Site Reliability Engineer (SRE)


Job Location:

San Francisco, CA - USA

Monthly Salary: $ 170000 - 250000
Posted on: 20 hours ago
Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer (SRE)

Location: San Francisco CA / Palo Alto CA
Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)
Office Type: Onsite (4 Days Per Week)
Salary: $170000$250000 Competitive Equity

Company Description

Were representing a rapidly growing AI infrastructure company building a next-generation GPU cloud platform for enterprises startups and AI researchers. Their platform provides flexible access to GPU compute through intelligent reservation marketplace and consumption models that help customers optimize performance availability and cost.

Backed by Sequoia Capital and Lightspeed with more than $80 million in funding the company has achieved 6x revenue growth over the past year. As demand for AI infrastructure accelerates theyre investing heavily in reliability engineering to build the automation observability and platform infrastructure that powers their multi-cloud GPU marketplace at scale.

What You Will Do

  • Design build and own the observability platform supporting a large-scale multi-cloud GPU infrastructure.
  • Develop monitoring distributed tracing dashboards and alerting systems using modern observability tooling.
  • Define and implement SLIs SLOs and operational metrics across customer-facing APIs and internal platform services.
  • Build automation that eliminates repetitive operational work and improves platform reliability.
  • Develop production tooling in Python or Go for infrastructure management health checks reconciliation and capacity optimization.
  • Design and maintain Infrastructure-as-Code using Terraform Pulumi and Kubernetes.
  • Improve platform resiliency through incident response root cause analysis and long-term reliability improvements.
  • Partner closely with Platform Product and Engineering teams to ensure new services are designed for operational excellence.
  • Help establish infrastructure engineering standards reliability practices and operational processes as the company scales.
  • Participate in production on-call rotations while continuously reducing operational burden through automation.

Ideal Background

  • 310 years of experience in Site Reliability Engineering Production Engineering Infrastructure Engineering or Platform Engineering.
  • Strong experience building production automation and operational tooling rather than solely responding to incidents.
  • Proven experience designing and operating large-scale Kubernetes environments.
  • Strong cloud infrastructure experience across AWS GCP Azure or multi-cloud environments.
  • Experience designing distributed systems with a strong understanding of networking fundamentals.
  • Proficiency with Python and/or Go for building production-grade infrastructure tooling.
  • Experience implementing observability platforms using Prometheus Grafana OpenTelemetry or similar technologies.
  • Strong understanding of Linux systems containers Docker and production operations.
  • Excellent communication skills with the ability to collaborate across engineering teams.

Preferred

  • Experience supporting AI infrastructure GPU clusters machine learning platforms or accelerated compute environments.
  • Familiarity with Terraform Pulumi Infrastructure-as-Code and cloud automation.
  • Experience designing reliability standards operational playbooks and incident management processes.
  • Background at high-growth startups or major cloud infrastructure organizations.
  • Strong understanding of distributed systems capacity planning and performance optimization.
  • Experience building greenfield infrastructure rather than maintaining legacy systems.
  • Passion for automation reducing operational toil and continuously improving developer experience.
  • Ability to thrive in fast-paced startup environments with significant ownership and autonomy.

Compensation and Benefits

  • Base salary: $170000$250000.
  • Competitive equity package.
  • Visa transfer sponsorship available.
  • Four-day onsite schedule across San Francisco and Palo Alto offices (all engineers collaborate in Palo Alto on Mondays).
  • Opportunity to help define the reliability and operational foundation of one of the fastest-growing AI infrastructure platforms.
  • Significant ownership over observability automation and production infrastructure.
  • Work alongside experienced engineers solving large-scale distributed systems and cloud infrastructure challenges.
  • Join a high-growth venture-backed company building the infrastructure powering the next generation of AI applications.

Required Experience:

IC

Site Reliability Engineer (SRE)Location: San Francisco CA / Palo Alto CACompany Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised)Office Type: Onsite (4 Days Per Week)Salary: $170000$250000 Competitive EquityCompany DescriptionWere representing a rapidly growing AI infrastructur...

About Company

Company Logo

Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.

View Profile View Profile