ML Ops Architect
Mode- Fulltime only
Work Location: Columbus Ohio
Work Type: Hybrid 3 days a week in office
- Customer-facing ML Ops roles
- AWS (SageMaker Glue Lambda CloudWatch)
- Azure DevOps (Repos and Pipelines)
- Terraform for IaC
- Model deployment monitoring and support across multiple LOBs
- Familiarity with ServiceNow for incident and change management
We are seeking a highly skilled and hands-on ML Ops Architect to be stationed onsite and work closely with customer stakeholders. The ideal candidate will be responsible for defining and standardizing ML Ops frameworks supporting the deployment and monitoring of productionized models and enabling the productionization of new models across multiple Lines of Business (LOBs). The architect must also ensure end-to-end automation robust observability and compliance with enterprise standards.
Key Responsibilities:
- Customer Engagement:
- Serve as the primary technical point of contact for ML Ops discussions with customer stakeholders.
- Collaborate with data science platform and operations teams across LOBs to align on model deployment strategy.
- Gather and refine non-functional requirements (security scalability reliability etc.) from the customer.
- ML Ops Framework and Architecture:
- Define document and evolve ML Ops architecture patterns for model lifecycle management.
- Design robust reusable and secure CI/CD pipelines for ML models using Azure DevOps (Repos Pipelines).
- Ensure reproducibility auditability and traceability for model training and deployment.
- Model Deployment and Support:
- Oversee productionization of new ML models across various LOBs.
- Provide technical guidance and support for existing productionized models.
- Manage model versioning rollback strategies and model registry using SageMaker.
- Infrastructure & Automation:
- Implement Infrastructure as Code using Terraform to provision and manage resources.
- Leverage AWS Glue Lambda Step Functions and SNS for data and model pipeline automation.
- Maintain and optimize scheduler workflows using EventBridge.
- Monitoring and Observability:
- Develop and maintain CloudWatch dashboards for model health and system metrics.
- Integrate EvidentlyAI for data drift and model performance monitoring.
- Ensure end-to-end observability including logging metrics and alerting.
- Operations and Support:
- Maintain documentation for model support procedures troubleshooting guides and deployment checklists.
- Work with ServiceNow for incident change and problem management processes.
- Support L1/L2 teams by enabling efficient monitoring and resolution mechanisms.
Required Skills & Experience:
- 10 years of IT experience with 3 years in ML Ops or ML Engineering roles.
- Strong hands-on experience with:
- Azure DevOps (Azure Repos Pipelines)
- AWS ML stack: SageMaker Glue Lambda Step Functions SNS S3 Athena
- Terraform for IaC
- CloudWatch EvidentlyAI for monitoring
- Docker ECR for image management
- Deep understanding of ML model lifecycle management and CI/CD practices.
- Proven ability to define enterprise-scale ML Ops frameworks and governance models.
- Prior experience in working with ServiceNow for operational support workflows.
- Strong communication and stakeholder management skills.