ML Ops & Observability Engineer
Job Summary
Use Your Power for Purpose
At Pfizer technology drives everything we do. You will play a pivotal role in implementing impactful and innovative technology solutions across all functions from research to manufacturing. Whether you are digitizing drug discovery and development identifying innovative solutions or streamlining our processes you will be making a significant impact on countless lives.
What You Will Achieve
MLOps Platform Execution & Model Operations
Lead the design implementation and operation of MLOps platforms supporting model development deployment monitoring and lifecycle management.
Own production workflows for:
Model packaging and deployment
Versioning and rollback
Promotion across environments (dev/test/prod)
Implement standardized CI/CD pipelines for ML workloads integrating with enterprise DevOps and infrastructure platforms.
Partner with infrastructure and DataOps teams to ensure ML workloads run on secure scalable and cost-effective cloud-native environments (AWS/Azure).
Translate Director-level AI platform strategy into reliable repeatable ML operational capabilities.
Model Data & System Observability
Own end-to-end observability for ML systems spanning:
Model performance and behavior
Data quality and drift
Pipeline health and system reliability
Implement and operate observability tooling using:
OpenTelemetry for distributed tracing
Metrics and dashboards (Prometheus Grafana)
Logs and analytics (ELK or equivalent)
Define and track ML-specific reliability signals such as:
Model performance degradation
Data drift and feature anomalies
Prediction latency and failure rates
Establish SLOs and alerting strategies for ML services in production.
Testing Validation & Responsible AI Enablement
Ensure testing and validation are embedded throughout the ML lifecycle including:
Model validation and regression testing
Data and feature consistency checks
Deployment verification and rollback testing
Integrate automated ML testing and quality gates into CI/CD pipelines.
Support non-functional testing for ML systems including:
Performance and scalability testing
Reliability and resilience testing
Security and access validation
Partner with AI data and compliance teams to support responsible and compliant AI operations including auditability traceability and explainability hooks (where required).
AI Platform Enablement & CrossTeam Collaboration
Enable data scientists and ML engineers to move models from experimentation to production efficiently and safely.
Provide reusable tooling templates and paved paths for:
Experiment tracking
Model registry usage
Deployment and monitoring patterns
Collaborate closely with:
Infrastructure Engineering (runtime scaling security)
DataOps Engineering (data pipelines feature stores data quality)
Product and analytics leaders to align ML capabilities to business outcomes.
Reliability Incident Management & Continuous Improvement
Own operational reliability for ML platforms and services.
Lead response to ML-related production incidents including:
Model failures or degradations
Data driftdriven issues
Pipeline or inference outages
Conduct post-incident reviews and drive systemic improvements.
Continuously improve MLOps maturity using SRE-inspired practices and metrics.
People Leadership & Engineering Ways of Working
Set clear expectations for operational ownership quality and delivery.
Coach engineers on:
MLOps best practices
Observability and reliability mindset
Secure and compliant AI operations
Establish strong engineering discipline through design reviews runbooks documentation and continuous learning.
Act as the primary execution partner to the Director-level Commercial AI Analytics Solutions & Engineering Lead for ML operations and observability.
Here Is What You Need (Minimum Requirements)
8 years of experience in ML engineering MLOps platform engineering or related roles with 3 years of people leadership.
Strong hands-on experience operationalizing ML systems in AWS or Azure environments.
Proven expertise in:
MLOps pipelines and tooling (experiment tracking model registry deployment monitoring)
CI/CD for ML workloads (e.g. GitHub Actions or equivalent)
Containerized and cloud-native ML runtimes
Solid understanding of testing and validation for ML systems including:
Model regression and performance testing
Data and feature validation
Deployment and rollback verification
Strong experience implementing observability and reliability practices using tools such as OpenTelemetry Prometheus Grafana and ELK.
Demonstrated experience with DevSecOps and secure SDLC for AI/ML systems including secrets management and access controls.
Proficiency in programming and scripting (e.g. Python Bash SQL; familiarity with ML frameworks).
Strong communication and collaboration skills; ability to deliver outcomes through teams and influence cross-functionally.
Bonus Points If You Have (Preferred Requirements)
Masters degree in Computer Science Data Science AI/ML or related field.
Experience with MLOps platforms and tools (e.g. MLflow Kubeflow feature stores).
Background in data drift detection model monitoring and ML reliability engineering.
Familiarity with responsible AI governance or regulated environments.
Relevant certifications:
AWS/Azure Professional
o Kubernetes (CKA/CKAD)
Cloud security or data/AI platform certifications
Work Location Assignment:Hybrid
Pfizer is an equal opportunity employer and complies with all applicable equal employment opportunity legislation in each jurisdiction in which it operates.
Information & Business TechRequired Experience:
IC
About Company
Erfahren Sie mehr über uns als forschendes und produzierendes Pharmaunternehmen: Von unserem Beitrag zum medizinischen Fortschritt bis zur nachhaltigen Produktion.