Site Reliability Engineer III (AIML SRE)

Jersey City - USA

Monthly Salary: $ 133000 - 185000

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.

As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models deepening client engagements and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices AI observability and incident response strategies to solve complex business challenges through high-quality cloud-centric software delivery.

Job Responsibilities:

Define and refine Service Level Objectives (SLOs) for large language model serving and training systems using metrics like accuracy fairness latency drift targets TTFT and TPOT while balancing reliability and development velocity.
Design implement and continuously improve monitoring systems to track availability latency drift and other key metrics for robust observability and rapid issue detection.
Collaborate in the design and deployment of high-availability language model serving infrastructure that supports high-traffic internal workloads across multiple regions and cloud providers.
Champion site reliability engineering practices providing technical leadership and fostering a culture of reliability resilience and continuous improvement across teams.
Develop and manage automated failover and recovery systems for model serving deployments ensuring seamless operation and rapid recovery from failures.
Create and lead AI-specific incident response playbooks for issues like model drift or bias spikes including automated rollbacks circuit breakers and systematic post-incident improvements.
Build and maintain cost optimization systems for large-scale AI infrastructure leveraging load balancing caching optimized GPU scheduling and AI Gateways to ensure efficient secure and scalable operations.

Required qualifications capabilities and skills:

Formal training or certification on AI reliability concepts and 3 years applied experience.
Demonstrate a strong sense of curiosity and a passion for continuous learning especially in the rapidly evolving field of AI reliability.
Show proficiency in reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices.
Possess deep knowledge and experience in observability including white and black box monitoring SLO alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog and Splunk.
Be proficient with continuous integration and delivery tools like Jenkins GitLab or Terraform as well as container and orchestration technologies such as ECS Kubernetes and Docker.
Have experience troubleshooting common networking technologies and issues and understand the unique challenges of operating AI infrastructure including model serving batch inference and training pipelines.
Communicate effectively and bridge the gap between ML engineers and infrastructure teams with proven experience implementing and maintaining SLO/SLA frameworks for business-critical services and working with both traditional and AI-specific metrics.

Preferred qualifications capabilities and skills

Experience with AI-specific observability tools and platforms such as OpenTelemetry and OpenInference.
Familiarity with AI incident response strategies including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs including metrics like accuracy fairness drift targets TTFT (Time To First Token) and TPOT (Time Per Output Token).
Expertise in engineering for scale and security including load balancing caching optimized GPU scheduling and AI Gateways.
Experience with continuous evaluation processes including pre-deployment pre-release and post-deployment monitoring for drift and degradation.
Understand ML model deployment strategies and their reliability implications
Have contributed to open-source infrastructure or ML tooling
Have experience with chaos engineering and systematic resilience testing

#LI-ID1

DescriptionAre you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.As a Senior AI Reliability Engineer at JPMorgan Chase w...

Description

Job Responsibilities:

Define and refine Service Level Objectives (SLOs) for large language model serving and training systems using metrics like accuracy fairness latency drift targets TTFT and TPOT while balancing reliability and development velocity.
Design implement and continuously improve monitoring systems to track availability latency drift and other key metrics for robust observability and rapid issue detection.
Collaborate in the design and deployment of high-availability language model serving infrastructure that supports high-traffic internal workloads across multiple regions and cloud providers.
Champion site reliability engineering practices providing technical leadership and fostering a culture of reliability resilience and continuous improvement across teams.
Develop and manage automated failover and recovery systems for model serving deployments ensuring seamless operation and rapid recovery from failures.
Create and lead AI-specific incident response playbooks for issues like model drift or bias spikes including automated rollbacks circuit breakers and systematic post-incident improvements.
Build and maintain cost optimization systems for large-scale AI infrastructure leveraging load balancing caching optimized GPU scheduling and AI Gateways to ensure efficient secure and scalable operations.

Required qualifications capabilities and skills:

Formal training or certification on AI reliability concepts and 3 years applied experience.
Demonstrate a strong sense of curiosity and a passion for continuous learning especially in the rapidly evolving field of AI reliability.
Show proficiency in reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices.
Possess deep knowledge and experience in observability including white and black box monitoring SLO alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog and Splunk.
Be proficient with continuous integration and delivery tools like Jenkins GitLab or Terraform as well as container and orchestration technologies such as ECS Kubernetes and Docker.
Have experience troubleshooting common networking technologies and issues and understand the unique challenges of operating AI infrastructure including model serving batch inference and training pipelines.
Communicate effectively and bridge the gap between ML engineers and infrastructure teams with proven experience implementing and maintaining SLO/SLA frameworks for business-critical services and working with both traditional and AI-specific metrics.

Preferred qualifications capabilities and skills

Experience with AI-specific observability tools and platforms such as OpenTelemetry and OpenInference.
Familiarity with AI incident response strategies including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs including metrics like accuracy fairness drift targets TTFT (Time To First Token) and TPOT (Time Per Output Token).
Expertise in engineering for scale and security including load balancing caching optimized GPU scheduling and AI Gateways.
Experience with continuous evaluation processes including pre-deployment pre-release and post-deployment monitoring for drift and degradation.
Understand ML model deployment strategies and their reliability implications
Have contributed to open-source infrastructure or ML tooling
Have experience with chaos engineering and systematic resilience testing

#LI-ID1

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

JPMorganChase

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans ov ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click