Site Reliability Engineer (SRE)

Recruiting From Scratch

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 170000 - 250000

Posted on: 20 hours ago

Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer (SRE)

Location: San Francisco CA / Palo Alto CA
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170000$250000 Competitive Equity

Company Description

Were representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises startups and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation marketplace and consumption models that help customers optimize performance availability and cost.

Backed by Sequoia Capital and Lightspeed with more than $80 million in funding the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates theyre investing heavily in reliability engineering to build the automation observability and platform infrastructure that powers their multi-cloud GPU marketplace at scale.

What You Will Do

Design build and own the observability platform supporting a large-scale multi-cloud GPU infrastructure.
Develop monitoring distributed tracing dashboards and alerting systems using modern observability tooling.
Define and implement SLIs SLOs and operational metrics across customer-facing APIs and internal platform services.
Build automation that eliminates repetitive operational work and improves platform reliability.
Develop production tooling in Python or Go for infrastructure management health checks reconciliation and capacity optimization.
Design and maintain Infrastructure-as-Code using Terraform Pulumi and Kubernetes.
Improve platform resiliency through incident response root cause analysis and long-term reliability improvements.
Partner closely with Platform Product and Engineering teams to ensure new services are designed for operational excellence.
Help establish infrastructure engineering standards reliability practices and operational processes as the company scales.
Participate in production on-call rotations while continuously reducing operational burden through automation.

Ideal Background

310 years of experience in Site Reliability Engineering Production Engineering Infrastructure Engineering or Platform Engineering.
Strong experience building production automation and operational tooling rather than solely responding to incidents.
Proven experience designing and operating large-scale Kubernetes environments.
Strong cloud infrastructure experience across AWS GCP Azure or multi-cloud environments.
Experience designing distributed systems with a strong understanding of networking fundamentals.
Proficiency with Python and/or Go for building production-grade infrastructure tooling.
Experience implementing observability platforms using Prometheus Grafana OpenTelemetry or similar technologies.
Strong understanding of Linux systems containers Docker and production operations.
Excellent communication skills with the ability to collaborate across engineering teams.

Preferred

Experience supporting AI infrastructure GPU clusters machine learning platforms or accelerated compute environments.
Familiarity with Terraform Pulumi Infrastructure-as-Code and cloud automation.
Experience designing reliability standards operational playbooks and incident management processes.
Background at high-growth startups or major cloud infrastructure organizations.
Strong understanding of distributed systems capacity planning and performance optimization.
Experience building greenfield infrastructure rather than maintaining legacy systems.
Passion for automation reducing operational toil and continuously improving developer experience.
Ability to thrive in fast-paced startup environments with significant ownership and autonomy.

Compensation and Benefits

Base salary: $170000$250000.
Competitive equity package.
Visa transfer sponsorship available.
Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
Significant ownership over observability automation and production infrastructure.
Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
Join a high-growth venture-backed company building the infrastructure powering the next generation of AI applications.

Required Experience:

Site Reliability Engineer (SRE)Location: San Francisco CA / Palo Alto CACompany Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)Office Type: Onsite (4 Days Per Week)Salary: $170000$250000 Competitive EquityCompany DescriptionWere representing a rapidly growing AI infrastructur...

Site Reliability Engineer (SRE)

Company Description

What You Will Do

Design build and own the observability platform supporting a large-scale multi-cloud GPU infrastructure.
Develop monitoring distributed tracing dashboards and alerting systems using modern observability tooling.
Define and implement SLIs SLOs and operational metrics across customer-facing APIs and internal platform services.
Build automation that eliminates repetitive operational work and improves platform reliability.
Develop production tooling in Python or Go for infrastructure management health checks reconciliation and capacity optimization.
Design and maintain Infrastructure-as-Code using Terraform Pulumi and Kubernetes.
Improve platform resiliency through incident response root cause analysis and long-term reliability improvements.
Partner closely with Platform Product and Engineering teams to ensure new services are designed for operational excellence.
Help establish infrastructure engineering standards reliability practices and operational processes as the company scales.
Participate in production on-call rotations while continuously reducing operational burden through automation.

Ideal Background

310 years of experience in Site Reliability Engineering Production Engineering Infrastructure Engineering or Platform Engineering.
Strong experience building production automation and operational tooling rather than solely responding to incidents.
Proven experience designing and operating large-scale Kubernetes environments.
Strong cloud infrastructure experience across AWS GCP Azure or multi-cloud environments.
Experience designing distributed systems with a strong understanding of networking fundamentals.
Proficiency with Python and/or Go for building production-grade infrastructure tooling.
Experience implementing observability platforms using Prometheus Grafana OpenTelemetry or similar technologies.
Strong understanding of Linux systems containers Docker and production operations.
Excellent communication skills with the ability to collaborate across engineering teams.

Preferred

Experience supporting AI infrastructure GPU clusters machine learning platforms or accelerated compute environments.
Familiarity with Terraform Pulumi Infrastructure-as-Code and cloud automation.
Experience designing reliability standards operational playbooks and incident management processes.
Background at high-growth startups or major cloud infrastructure organizations.
Strong understanding of distributed systems capacity planning and performance optimization.
Experience building greenfield infrastructure rather than maintaining legacy systems.
Passion for automation reducing operational toil and continuously improving developer experience.
Ability to thrive in fast-paced startup environments with significant ownership and autonomy.

Compensation and Benefits

Base salary: $170000$250000.
Competitive equity package.
Visa transfer sponsorship available.
Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
Significant ownership over observability automation and production infrastructure.
Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
Join a high-growth venture-backed company building the infrastructure powering the next generation of AI applications.

Required Experience:

Apply Now

About Company

Recruiting From Scratch

Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Site Reliability Engineer (SRE)

San Francisco, CA - USA

Job Summary

Site Reliability Engineer (SRE)

Company Description

What You Will Do

Ideal Background

Preferred

Compensation and Benefits

Site Reliability Engineer (SRE)

Company Description

What You Will Do

Ideal Background

Preferred

Compensation and Benefits

About Company

Related Jobs