Lead AI Engineer- DevOps
Job Summary
ROLE SUMMARY
As a Lead DevOps Engineer you will lead the design and operation of the build release infrastructure observability and runtime engineering practices that enable product teams to ship and operate secure scalable and reliable digital solutions including AI-enabled and agentic AI products. This role is not limited to infrastructure automation; it also requires a strong full-stack engineering flavor with the ability to understand how frontend backend APIs data services and AI services come together in production systems.
You will work closely with software engineers AI engineers integration engineers Team Leads architects and platform teams to ensure systems are deployable observable supportable and cost-aware. You will guide CI/CD design environment standardization release automation platform reliability cloud-native deployment practices and engineering enablement across the SDLC.
The ideal candidate brings deep experience in DevOps cloud platforms automation platform engineering containerization release engineering and observability along with strong practical understanding of application architecture full-stack delivery patterns and production support expectations.
KEY RESPONSIBILITIES
- Lead the design implementation and continuous improvement of CI/CD pipelines deployment workflows environment strategies and release automation for digital and AI-enabled products
- Build and operate cloud-native infrastructure and runtime platforms that support backend services APIs UI applications integrations and AI workloads
- Partner with engineering teams to improve deployability testability scalability observability and operational resilience across the full product stack
- Design and maintain infrastructure-as-code environment provisioning secrets management access control and deployment consistency across development test and production environments
- Support delivery of containerized services microservices web applications event-driven systems and AI-enabled application components
- Contribute to architecture and delivery discussions by bringing a strong understanding of backend services APIs frontend deployment needs runtime dependencies and full-stack production patterns
- Implement and optimize observability using logs traces metrics distributed tracing dashboards alerts and cost / capacity signals
- Support AI and ML workloads by enabling deployment environments model-serving patterns runtime monitoring cost visibility and release controls
- Drive operational readiness practices such as runbooks deployment validation rollback mechanisms incident response root-cause analysis and post-incident improvement
- Standardize engineering practices for build automation release quality environment hygiene dependency control and operational support
- Collaborate with security architecture and platform teams to ensure solutions meet requirements for security reliability compliance supportability and scale
- Mentor engineers on DevOps and platform engineering best practices and contribute reusable accelerators for delivery teams
Required Qualifications
- 6 to 8 years of experience in DevOps platform engineering site reliability engineering cloud engineering or software engineering including strong hands-on experience operating production systems in enterprise environments
- Proven experience building and operating CI/CD pipelines cloud-native deployment platforms containerized workloads infrastructure automation and release engineering frameworks
- Strong hands-on experience with Azure DevOps GitHub GitHub Actions Terraform Bicep Docker Kubernetes / AKS Azure Container Registry Azure Functions Azure Container Apps or equivalent DevOps and cloud platforms
- Strong understanding of cloud-native application delivery including backend APIs event-driven services authentication flows runtime dependencies deployment pipelines and production support models
- Practical experience with full-stack application delivery patterns including operational understanding of React-based frontends services Python / backend APIs REST services microservices containerized applications and modern web deployment architectures
- Familiarity with frontend and backend build pipelines static asset deployment service configuration environment variables API gateway integration and full-stack runtime troubleshooting
- Strong experience with observability logging tracing and platform diagnostics using tools such as Azure Monitor Application Insights OpenTelemetry Log Analytics Datadog New Relic Grafana Prometheus or equivalent monitoring and reliability platforms
- Experience implementing infrastructure as code secrets and identity management environment standardization deployment controls rollback strategies and operational governance practices
- Familiarity with AI- and ML-enabled workloads including runtime support for Azure OpenAI Azure AI Studio PromptFlow Azure Machine Learning or equivalent platforms from a deployment monitoring and operational readiness standpoint
- Understanding of CI/CD test automation release quality incident response root-cause analysis and continuous reliability improvement across the SDLC
- Ability to work closely with software engineers AI engineers architects and Team Leads to enable fast secure and maintainable delivery
- Proven ability to reduce operational toil improve engineering productivity and standardize delivery through automation and platform improvements
- Strong communication and technical leadership skills including mentoring engineers and influencing engineering standards across teams
Preferred Qualifications
- Experience supporting or enabling AI / GenAI / agentic AI products in production environments
- Familiarity with Azure OpenAI Azure AI Studio PromptFlow Azure Machine Learning or equivalent platforms from a deployment monitoring and operational support perspective
- Experience designing deployment and runtime patterns for LLM-powered services agent orchestration services vector-enabled retrieval and API-integrated AI systems
- Familiarity with Model Context Protocol (MCP) asynchronous workflows long-running agents or other runtime patterns relevant to agentic AI systems
- Experience enabling secure delivery of products with integrations into SAP ServiceNow API gateways workflow platforms and event-driven enterprise systems
- Hands-on experience with performance tuning caching strategies request tracing service dependency analysis and runtime diagnostics in full-stack production systems
- Experience contributing platform accelerators reusable IaC modules DevOps templates shared dashboards or internal engineering enablement toolkits
- Familiarity with cost optimization / FinOps capacity planning and scaling strategies for cloud-native and AI-heavy workloads
- Experience in a build-own-operate product organization where engineering teams are responsible for long-term supportability and operational excellence
- Ability to influence architecture platform choices and delivery patterns across multiple teams without losing hands-on technical depth.
Required Experience:
IC
About Company
Ecolab is the global leader in water, hygiene and energy technologies and services. Every day, we help make the world cleaner, safer and healthier – protecting people and vital resources.