Company Overview
Group/Division
Job Description/Preferred Qualifications
We are seeking a hands-on AI/ML Engineer specializing inMLOpsand Site Reliability Engineering (SRE) to buildoperate and continuously improve production-grade machine learning this role you will partner with data scientists data engineers and software teams to standardize the ML lifecycle improve reliability and performance and enable rapid safe delivery of models and AI services at scale.
Key Responsibilities
Production ML Platform & Tooling
Design and implement reusableMLOpsplatform capabilities for training deployment and monitoring of ML/LLM systems.
Build standardized pipelines for data validation feature generation training evaluation model packaging and release.
Own model registry artifact storage and metadata lineage to ensure reproducibility and auditability.
Deployment Engineering & Release Safety
Deploy models and AI services using containers and orchestration (e.g. Kubernetes) with robust rollout strategies (blue/green canary A/B).
Create CI/CD workflows for ML code and pipelines including automated tests quality gates and approval controls.
Harden inference services for low latency and high throughput using caching batching autoscaling and efficient model serving patterns.
Reliability Observability & Incident Response (SRE)
Define and track service-level indicators (SLIs) and service-levelobjectives(SLOs) for ML services pipelines and data dependencies.
Implement end-to-end observability: structured logging metrics tracing dashboards and alerting for both infrastructure and model behavior.
Lead incident response and post-incident reviews; drive systemic fixes through runbooks automation and reliability engineering practices.
Model & Data Monitoring
Implement monitoring for model quality and data health: drift bias performance degradation and data pipeline anomalies.
Build automated feedback loops to trigger investigations retraining workflows and safe rollback when quality thresholds are breached.
Security Compliance & Governance
Integrate security best practices:secretsmanagement least-privilege access (RBAC) network controls and vulnerability scanning.
Support compliance and governance requirements for model usage data access retention and responsible AI practices.
Collaboration & Enablement
Partner with data science and engineering teams to translate business requirements into reliable scalable ML solutions.
Create developer-friendly documentation templates and internal best practices; mentor teams onMLOpsand reliability standards.
Required Qualifications
Bachelors degree in Computer Science Engineering Data Science or a related field with 5 years of relevant experience; OR a Masters/PhD with 3 years of relevant experience.
Proven experience deploying and operating ML models or AI services in production environments.
Strong programming skills in Python and experience with common ML libraries and frameworks (e.g.PyTorch TensorFlow scikit-learn).
Hands-on DevOps/SRE experience: CI/CD infrastructure as code containerization and operational excellence.
Experience with cloud platforms and managed services (Azure AWS or GCP) and building scalable secure systems.
Experience with Kubernetes and modern model serving patterns (REST/gRPC async workers batch/stream inference).
Strong understanding of monitoring and observability (metrics logs traces) and incident management processes.
Ability to communicate clearly with both technical and non-technical stakeholders and tooperateeffectively in cross-functional teams.
Preferred Qualifications
Experience with ML platform tools such asMLflow Kubeflow Airflow SageMaker Vertex AI or Azure Machine Learning.
Experience with feature stores data quality frameworks and dataset/versioning tools (e.g. Feast Great Expectations DVC).
Experience with distributed systems performance tuning (autoscaling queueing caching load shedding).
Experience implementing model monitoring for drift bias and quality (e.g. Evidentlywhylogs custom statistical checks).
Knowledge of security and compliance patterns for enterprise AI (data classification encryption audit logging).
Contributions to open-source projects publications ordemonstratedtechnical leadership through talks/blogs.
What Success Looks Like (First 6-12 Months)
Standardized CI/CD and deployment patterns for ML services that reduce time-to-production while improving safety and reliability.
Clear SLOs dashboards and alerts for critical AI services with measurable improvements in uptime latency and incident response.
Automated monitoring and quality checks that detect drift and data issues early with repeatable remediation workflows.
Improved reproducibility and governance through consistent artifact tracking lineage and documentation.
Note: Technology choices may vary byteamneeds; candidates should be comfortable learning and adapting to new tools.
Minimum Qualifications
Doctorate (Academic) Degree and 0 years related work experience; Masters Level Degree and related work experience of 3 years; Bachelors Level Degree and related work experience of 5 yearsWe offer a competitive family friendly total rewards package. We design our programs to reflect our commitment to an inclusive environment while ensuring we provide benefits that meet the diverse needs of our employees.
KLA is proud to be an equal opportunity employer
Be aware of potentially fraudulent job postings or suspicious recruiting activity by persons that are currently posing as KLA employees. KLA never asks for any financial compensation to be considered for an interview to become an employee or for equipment. Further KLA does not work with any recruiters or third parties who charge such fees either directly or on behalf of KLA. Please ensure that you have searched KLAs Careers website for legitimate job postings. KLA follows a recruiting process that involves multiple interviews in person or on video conferencing with our hiring managers. If you are concerned that a communication an interview an offer of employment or that an employee is not legitimate please send an email to to confirm the person you are communicating with is an employee. We take your privacy very seriously and confidentially handle your information.
Required Experience:
IC
Calling the adventurers ready to join a company that's pushing the limits of nanotechnology to keep the digital revolution rolling. At KLA, we're making technology advancements that are bigger—and tinier—than the world has ever seen. Who are we? We research, develop, and manufacture t ... View more