drjobs Site Reliability Engineer AIML Associate

Site Reliability Engineer AIML Associate

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Bengaluru - India

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Description

Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.

As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models deepening client engagements and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices AI observability and incident response strategies to solve complex business challenges through high-quality cloud-centric software delivery.

Job Responsibilities:

  • Develop and refine Service Level Objectives( including metrics like accuracy fairness latency drift targets TTFT (Time To First Token) and TPOT (Time Per Output Token)) for large language model serving and training systems balancing availability/latency with development velocity

  • Design implement and continuously improve monitoring systems including availability latency and other salient metrics

  • Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads

  • Champion site reliability culture and practices providing technical leadership and influence across teams to foster a culture of reliability and resilience

  • Champion site reliability culture and practices and exerts technical influence throughout your team

  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers

  • Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes including automated rollbacks and AI circuit breakers.
    Lead incident response for critical AI services ensuring rapid recovery and systematic improvements from each incident
    Build and maintain cost optimization systems for large-scale AI infrastructure ensuring efficient resource utilization without compromising performance.

  • Engineer for Scale and Security leveraging techniques like load balancing caching optimized GPU scheduling and AI Gateways for managing traffic and security.

  • Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure bridging the gap between development and operations.

  • Implement Continuous Evaluation including pre-deployment pre-release and continuous post-deployment monitoring for drift and degradation.

Required qualifications capabilities and skills:

  • Demonstrated proficiency in reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices

  • Proficient knowledge and experience in observability such as white and black box monitoring service level objective alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog Splunk and others

  • Proficient with continuous integration and continuous delivery tools like Jenkins GitLab or Terraform

  • Proficient with container and container orchestration: (ECS Kubernetes Docker)

  • Experience with troubleshooting common networking technologies and issues

  • Understand the unique challenges of operating AI infrastructure including model serving batch inference and training pipelines

  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services

  • Comfortable working with both traditional metrics (latency availability) and AI-specific metrics (model performance training convergence)
    Can effectively bridge the gap between ML engineers and infrastructure teams
    Have excellent communication skills

Preferred qualifications capabilities and skills

  • Experience with AI-specific observability tools and platforms such as OpenTelemetry and OpenInference.

  • Familiarity with AI incident response strategies including automated rollbacks and AI circuit breakers.

  • Knowledge of AI-centric SLOs/SLAs including metrics like accuracy fairness drift targets TTFT (Time To First Token) and TPOT (Time Per Output Token).

  • Expertise in engineering for scale and security including load balancing caching optimized GPU scheduling and AI Gateways.
    Experience with continuous evaluation processes including pre-deployment pre-release and post-deployment monitoring for drift and degradation.

  • Understand ML model deployment strategies and their reliability implications

  • Have contributed to open-source infrastructure or ML tooling

  • Have experience with chaos engineering and systematic resilience testing





Required Experience:

IC

Employment Type

Full-Time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.