Member of Technical Staff, AI Reliability & Monitoring Engineering Lead

Sign Up For Free

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: $ 256000 - 276000
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

The Opportunity

Postman is seeking an experienced AI Systems Reliability Engineer to help define build and maintain the infrastructure and processes that ensure the reliability scalability and performance of Postmans AI-powered API and agentic systems in production. This role focuses on monitoring availability incident response and automation to support AI services and tools trusted by millions of developers globally.

What Youll Do

  • Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features

  • Implement comprehensive observability and monitoring systems for real-time performance and fault detection

  • Design and drive automated failover recovery and incident response strategies for high-availability AI infrastructure

  • Optimize resource utilization particularly GPU/accelerator efficiency ensuring cost-effective AI system operation

  • Collaborate closely with engineering platform and product teams to align reliability efforts with broader organizational goals

  • Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence

  • Drive continuous improvement in deployment practices monitoring approaches and incident management processes

About You

  • Have a strong background in AI reliability engineering SRE or DevOps for distributed systems

  • Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks

  • Are experienced with cloud platforms monitoring tools and incident response automation

  • Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health

  • Thrive in dynamic fast-paced environments focusing on delivering reliable safe AI-powered services

Bonus Skills and Experiences

  • Hands-on experience with AI/ML infrastructure including GPU/xPU optimization and scaling

  • Familiarity with API platform operations and large-scale distributed services

  • Prior experience building or operating observability tools tailored for AI and agentic systems

  • Contribution to open-source projects or reliability engineering thought leadership

The reasonably estimated base salary for this role ranges from $256000 to $276000 plus a competitive equity package. Actual compensation is based on the candidates skills qualifications and experience.


Required Experience:

IC

The OpportunityPostman is seeking an experienced AI Systems Reliability Engineer to help define build and maintain the infrastructure and processes that ensure the reliability scalability and performance of Postmans AI-powered API and agentic systems in production. This role focuses on monitoring av...
View more view more

Key Skills

  • ICT
  • ASP.NET
  • Gas
  • Field

About Company

Company Logo

Accelerate API development with Postman's all-in-one platform. Streamline collaboration and simplify the API lifecycle for faster, better results. Learn more.

View Profile View Profile