About the Role
Join the AI Initiatives team as a Site Reliability Engineer and help operate scale and evolve the foundation that powers AI products across the company.
This role sits within the AI team and focuses on ensuring that AI platforms services and agent-based systems are reliable scalable observable and secure in production.
This is not a pure operations role. The position requires strong software engineering skills combined with deep experience in cloud infrastructure DevOps practices and system reliability. A genuine interest in AI systems and how they behave in real-world production environments is essential.
Responsibilities
As a Site Reliability Engineer - AI Platform this role plays a critical role in enabling the reliable delivery and operation of AI-powered products and platforms used across the organization.
- Build and Operate Reliable AI Infrastructure
- Design deploy and operate cloud-native infrastructure supporting AI workloads including LLM services RAG pipelines agent-based systems and internal AI platforms.
- Full-Stack DevOps & Engineering
- Develop automation tooling and services to support CI/CD deployment configuration and lifecycle management of AI systems. Balance hands-on development work with infrastructure ownership and operational responsibilities.
- Infrastructure as Code & Automation
- Define and manage infrastructure using Infrastructure as Code (e.g. Terraform CloudFormation) and build automation for provisioning scaling recovery and routine operations.
- Observability & Incident Management
- Design and maintain observability solutions (monitoring logging tracing alerting) to ensure high availability fast detection of issues and effective incident response for AI services.
- System Architecture & Reliability
- Partner with AI engineers and product teams to review system designs identify reliability risks define SLOs/SLIs and improve fault tolerance scalability and resilience of AI-powered systems.
- Cloud Native Delivery
- Operate and evolve containerized platforms using Docker and Kubernetes; support safe and frequent deployments through robust CI/CD pipelines.
- AI-Aware Operations
- Develop an understanding of AI-specific operational challenges such as model serving LLM latency rate limits cost control caching retries fallbacks and data pipeline reliability.
- Cross-Team Collaboration
- Work closely with AI engineers software engineers and product teams to ensure that reliability operability and scalability are first-class concerns throughout the product lifecycle.
Qualifications :
We are seeking a seasoned professional with a strong technical background and a passion for building world-class AI applications.
Must-Have Qualifications:
- 8 years of experience in software engineering SRE DevOps or platform engineering roles.
- Strong programming skills (e.g. Python Go JavaScript or similar) with experience building internal tools and automation.
- Solid experience with cloud platforms (AWS GCP or Azure) and cloud-native architectures.
- Hands-on experience with DevOps practices CI/CD pipelines and container orchestration (Docker Kubernetes).
- Strong knowledge of Infrastructure as Code (Terraform CloudFormation or equivalent).
- Experience designing and operating observability systems (monitoring logging alerting)
- Strong understanding of system architecture reliability engineering and production operations
- Passion for AI technologies and curiosity about how AI systems behave at scale.
Nice-to-Have Qualifications:
- Experience supporting AI or data-intensive systems in production environments.
- Familiarity with AI/ML workloads such as model serving RAG pipelines or agent-based systems.
- Understanding of reliability challenges specific to AI systems (latency cost control scaling failure modes).
- Experience operating enterprise-grade platforms with high availability security and compliance requirements.
- Be familiar with AI service platform i.e AWS bedrock or azure foundry
- Experience with AI agents and Model Context Protocol (MCP) including operating integrating or supporting agent-based systems in production environments.
Additional Information :
Growth Opportunities
Joining our team as a Senior Software Engineer in AI Applications offers a unique chance to work on industry-leading projects that shape the future of AI technology. You will have the opportunity to:
- Engage in continuous learning and professional development to stay at the forefront of AI advancements.
- Take on increased responsibilities and influence the strategic direction of our AI product offerings and drive impactful innovation.
Remote Work :
No
Employment Type :
Full-time
About the RoleJoin the AI Initiatives team as a Site Reliability Engineer and help operate scale and evolve the foundation that powers AI products across the company.This role sits within the AI team and focuses on ensuring that AI platforms services and agent-based systems are reliable scalable obs...
About the Role
Join the AI Initiatives team as a Site Reliability Engineer and help operate scale and evolve the foundation that powers AI products across the company.
This role sits within the AI team and focuses on ensuring that AI platforms services and agent-based systems are reliable scalable observable and secure in production.
This is not a pure operations role. The position requires strong software engineering skills combined with deep experience in cloud infrastructure DevOps practices and system reliability. A genuine interest in AI systems and how they behave in real-world production environments is essential.
Responsibilities
As a Site Reliability Engineer - AI Platform this role plays a critical role in enabling the reliable delivery and operation of AI-powered products and platforms used across the organization.
- Build and Operate Reliable AI Infrastructure
- Design deploy and operate cloud-native infrastructure supporting AI workloads including LLM services RAG pipelines agent-based systems and internal AI platforms.
- Full-Stack DevOps & Engineering
- Develop automation tooling and services to support CI/CD deployment configuration and lifecycle management of AI systems. Balance hands-on development work with infrastructure ownership and operational responsibilities.
- Infrastructure as Code & Automation
- Define and manage infrastructure using Infrastructure as Code (e.g. Terraform CloudFormation) and build automation for provisioning scaling recovery and routine operations.
- Observability & Incident Management
- Design and maintain observability solutions (monitoring logging tracing alerting) to ensure high availability fast detection of issues and effective incident response for AI services.
- System Architecture & Reliability
- Partner with AI engineers and product teams to review system designs identify reliability risks define SLOs/SLIs and improve fault tolerance scalability and resilience of AI-powered systems.
- Cloud Native Delivery
- Operate and evolve containerized platforms using Docker and Kubernetes; support safe and frequent deployments through robust CI/CD pipelines.
- AI-Aware Operations
- Develop an understanding of AI-specific operational challenges such as model serving LLM latency rate limits cost control caching retries fallbacks and data pipeline reliability.
- Cross-Team Collaboration
- Work closely with AI engineers software engineers and product teams to ensure that reliability operability and scalability are first-class concerns throughout the product lifecycle.
Qualifications :
We are seeking a seasoned professional with a strong technical background and a passion for building world-class AI applications.
Must-Have Qualifications:
- 8 years of experience in software engineering SRE DevOps or platform engineering roles.
- Strong programming skills (e.g. Python Go JavaScript or similar) with experience building internal tools and automation.
- Solid experience with cloud platforms (AWS GCP or Azure) and cloud-native architectures.
- Hands-on experience with DevOps practices CI/CD pipelines and container orchestration (Docker Kubernetes).
- Strong knowledge of Infrastructure as Code (Terraform CloudFormation or equivalent).
- Experience designing and operating observability systems (monitoring logging alerting)
- Strong understanding of system architecture reliability engineering and production operations
- Passion for AI technologies and curiosity about how AI systems behave at scale.
Nice-to-Have Qualifications:
- Experience supporting AI or data-intensive systems in production environments.
- Familiarity with AI/ML workloads such as model serving RAG pipelines or agent-based systems.
- Understanding of reliability challenges specific to AI systems (latency cost control scaling failure modes).
- Experience operating enterprise-grade platforms with high availability security and compliance requirements.
- Be familiar with AI service platform i.e AWS bedrock or azure foundry
- Experience with AI agents and Model Context Protocol (MCP) including operating integrating or supporting agent-based systems in production environments.
Additional Information :
Growth Opportunities
Joining our team as a Senior Software Engineer in AI Applications offers a unique chance to work on industry-leading projects that shape the future of AI technology. You will have the opportunity to:
- Engage in continuous learning and professional development to stay at the forefront of AI advancements.
- Take on increased responsibilities and influence the strategic direction of our AI product offerings and drive impactful innovation.
Remote Work :
No
Employment Type :
Full-time
View more
View less