Site Reliability Engineer on AI Platform, Director

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer on AI Platform Director

Were seeking someone to join our AI Platform team as Site Reliability Engineer on AI Platform to help support scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering cloud engineering data engineering and security teams to ensure availability reliability performance and security of production AI workloads (training inference data pipelines) in a regulated high-stakes financial environment. As an SRE on the AI platform you will bring deep operations automation and systems engineering skills to enable our models and pipelines to run reliably at scale while balancing cost security and compliance constraints.

The ideal candidate will have strong hands-on experience supporting software platforms on any combination of the following platforms - Kubernetes Cloud (AWS Azure and/or Google) API based development REST framework data engineering and large-scale API Gateway environments etc. Knowledge of AIML and hands-on experience implementing solutions using Generative AI are also preferable. The candidate will have great communication skills a team-based mentality and a strong passion for using AI to increase productivity as well as help generate new ideas for product & technical improvements.

Our mission is to develop a firmwide Artificial Intelligence (AI) Development Platform that aligns with the firms Technology principles and drives efficiency and consistency controls security and strong governance and promotes innovation enabling teams to build applications that leverage AI capabilities and accelerate the adoption of AI across our businesses.

In the Technology division we leverage innovation to build the connections and capabilities that power our Firm enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a SRE on the AI platform position at Director level which is part of Infrastructure Production Management & Reliability Engineering job family that maintains the stability and reliability of the organizations infrastructure systems ensuring optimal performance and availability to support business operations.

Morgan Stanley is an industry leader in financial services known for mobilizing capital to help governments corporations institutions and individuals around the world achieve their financial goals.

What youll do in the role:

Operate monitor and maintain the infrastructure supporting GenAI applications (training inference feature store data ingestion model serving).
Design and build automation for core platform capabilities reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute storage network GPU clusters Kubernetes / container orchestration etc. Establish monitor and enforce SLOs/SLIs/SLAs error budgets alerting and dashboards.
Work on Grafana dashboards for various metrics which are being scrapped by Prometheus.
Lead incident response root cause analysis (RCA) postmortems and systemic remediation.
Perform capacity planning scaling strategies workload scheduling and resource forecasting.
Optimize cost vs. performance tradeoffs in large-scale compute environments.
Harden systems for security compliance auditability and data governance
Collaborate across teams (cloud engineers data engineers infrastructure security) to ensure safe deployment rollout rollback and integration of new systems.
Define disaster recovery (DR) strategies backup/restore practices fault tolerance mechanisms.
Maintain runbooks operational playbooks documentation and training materials.
Participate in on-call rotations and respond to production incidents 24/7 as needed.
Continuously evaluate and integrate new tools frameworks or technologies to enhance platform reliability.

What youll bring to the role:

At least 6 years relevant experience would generally be expected to find the skills required for this role.
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python Go Java or equivalent)
Deep experience with containerization (Docker) orchestration (Kubernetes etc.)
Experience with monitoring / observability / logging / alerting tools (Prometheus Grafana ELK / EFK Datadog PagerDuty etc.)

Nice to have

Understanding of SRE techniques.

Infrastructure-as-code (Terraform Helm CloudFormation Ansible etc.)
Familiarity with GPU / AI compute clusters high-performance data storage and distributed architectures
Networking & systems engineering knowledge (TCP/IP DNS routing load balancing distributed storage)
Solid experience in capacity planning performance tuning scaling and incident response
Demonstrated ability to lead RCAs deploy fixes and drive reliability improvements
Experience in regulated environments (financial services compliance audit security) is a strong plus
Excellent communication documentation and cross-team collaboration skills
Proven track record of reducing operational toil via automation
Proficiency with Open Telemetry tools including Grafana Loki Prometheus and Cortex.
Good knowledge of Microservice based architecture industry standards for both public and private cloud.
Knowledge of data pipeline technologies (Kafka Spark Flink etc.)
Good knowledge of various DB engines (SQL Redis Kafka Snowflake etc) for cloud app storage.
Experience working with Generative AI development embeddings fine tuning of Generative AI models
Experience in high-performance computing (HPC) distributed GPU cluster scheduling (e.g. Slurm Kubernetes GPU scheduling)
Understanding of ModelOps/ ML Ops/ LLM Op.
Experience with chaos engineering canary deployments blue/green rollouts

WHAT YOU CAN EXPECT FROM MORGAN STANLEY:

We are committed to maintaining the first-class service and high standard of excellence that have defined Morgan Stanley for over 89 years. Our values - putting clients first doing the right thing leading with exceptional ideas committing to diversity and inclusion and giving back - arent just beliefs they guide the decisions we make every day to do whats best for our clients communities and more than 80000 employees in 1200 offices across 42 countries. At Morgan Stanley youll find an opportunity to work alongside the best and the brightest in an environment where you are supported and empowered. Our teams are relentless collaborators and creative thinkers fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey offering some of the most attractive and comprehensive employee benefits and perks in the industry. Theres also ample opportunity to move about the business for those who show passion and grit in their work.

To learn more about our offices across the globe please copy and paste into your browser.

Morgan Stanley is an equal opportunities employer. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential. Our skilled and creative workforce is comprised of individuals drawn from a broad cross section of the global communities in which we operate and who reflect a variety of backgrounds talents perspectives and experiences. Our strong commitment to a culture of inclusion is evident through our constant focus on recruiting developing and advancing individuals based on their skills and talents.

Required Experience:

Director

Site Reliability Engineer on AI Platform DirectorWere seeking someone to join our AI Platform team as Site Reliability Engineer on AI Platform to help support scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering cloud engine...

Site Reliability Engineer on AI Platform Director

Morgan Stanley is an industry leader in financial services known for mobilizing capital to help governments corporations institutions and individuals around the world achieve their financial goals.

What youll do in the role:

Operate monitor and maintain the infrastructure supporting GenAI applications (training inference feature store data ingestion model serving).
Design and build automation for core platform capabilities reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute storage network GPU clusters Kubernetes / container orchestration etc. Establish monitor and enforce SLOs/SLIs/SLAs error budgets alerting and dashboards.
Work on Grafana dashboards for various metrics which are being scrapped by Prometheus.
Lead incident response root cause analysis (RCA) postmortems and systemic remediation.
Perform capacity planning scaling strategies workload scheduling and resource forecasting.
Optimize cost vs. performance tradeoffs in large-scale compute environments.
Harden systems for security compliance auditability and data governance
Collaborate across teams (cloud engineers data engineers infrastructure security) to ensure safe deployment rollout rollback and integration of new systems.
Define disaster recovery (DR) strategies backup/restore practices fault tolerance mechanisms.
Maintain runbooks operational playbooks documentation and training materials.
Participate in on-call rotations and respond to production incidents 24/7 as needed.
Continuously evaluate and integrate new tools frameworks or technologies to enhance platform reliability.

What youll bring to the role:

At least 6 years relevant experience would generally be expected to find the skills required for this role.
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python Go Java or equivalent)
Deep experience with containerization (Docker) orchestration (Kubernetes etc.)
Experience with monitoring / observability / logging / alerting tools (Prometheus Grafana ELK / EFK Datadog PagerDuty etc.)

Nice to have

Understanding of SRE techniques.

Infrastructure-as-code (Terraform Helm CloudFormation Ansible etc.)
Familiarity with GPU / AI compute clusters high-performance data storage and distributed architectures
Networking & systems engineering knowledge (TCP/IP DNS routing load balancing distributed storage)
Solid experience in capacity planning performance tuning scaling and incident response
Demonstrated ability to lead RCAs deploy fixes and drive reliability improvements
Experience in regulated environments (financial services compliance audit security) is a strong plus
Excellent communication documentation and cross-team collaboration skills
Proven track record of reducing operational toil via automation
Proficiency with Open Telemetry tools including Grafana Loki Prometheus and Cortex.
Good knowledge of Microservice based architecture industry standards for both public and private cloud.
Knowledge of data pipeline technologies (Kafka Spark Flink etc.)
Good knowledge of various DB engines (SQL Redis Kafka Snowflake etc) for cloud app storage.
Experience working with Generative AI development embeddings fine tuning of Generative AI models
Experience in high-performance computing (HPC) distributed GPU cluster scheduling (e.g. Slurm Kubernetes GPU scheduling)
Understanding of ModelOps/ ML Ops/ LLM Op.
Experience with chaos engineering canary deployments blue/green rollouts

WHAT YOU CAN EXPECT FROM MORGAN STANLEY:

To learn more about our offices across the globe please copy and paste into your browser.

Required Experience:

Director

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Morgan Stanley

Morgan Stanley is a leading global financial services firm providing a wide range of investment banking, securities, investment management and wealth management services. The Firm's employees serve clients worldwide including corporations, governments and individuals from more than 1, ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click