Job Title: ML Ops Platform Engineer
Location: Reston VA (Hybrid 3 days onsite)
Duration: 12 Months Contract
Interview Process: Face-to-Face Interview Position Overview We are seeking a highly skilled ML Ops Platform Engineer to support enterprise-scale machine learning platform engineering and operations. This role focuses on building managing and optimizing MLOps infrastructure across AWS and Kubernetes (EKS) environments to enable scalable model training deployment and monitoring across development and production ecosystems.
Key Responsibilities Platform Engineering & Operations -
Engineer manage and support MLOps platform components across AWS and EKS environments.
-
Oversee infrastructure used for ML training batch inference and real-time model serving.
-
Ensure high availability resilience and performance across dev test and production.
-
Implement RBAC network policies and scalable namespace architecture within EKS clusters.
Model Deployment & CI/CD Automation -
Build and manage CI/CD pipelines (GitLab) for model packaging container builds vulnerability scanning and automated deployments.
-
Enable standardized release processes including versioning environment promotion and rollback workflows.
-
Integrate CI/CD with ML frameworks model registries and runtime environments.
Container & Kubernetes Workloads -
Design and manage containerized ML workloads within AWS EKS.
-
Implement auto-scaling resource quotas workload isolation and cluster optimization.
-
Support GPU and CPU-based ML training and inference workloads.
Monitoring Observability & Optimization -
Implement logging monitoring and alerting for ML pipelines and endpoints.
-
Optimize compute storage and data transfer usage for cost efficiency.
-
Lead incident management root cause analysis and long-term remediation planning.
Collaboration & Enablement -
Partner with Data Scientists and ML Engineers to operationalize ML solutions.
-
Provide guidance on ML lifecycle management and scalable deployment architectures.
-
Contribute to documentation runbooks and knowledge-sharing initiatives.
Required Qualifications -
3 years of hands-on AWS experience (EKS EC2 S3 IAM CloudWatch ECR).
-
Strong Kubernetes operations experience (preferably AWS EKS).
-
Proficiency with Docker and container orchestration.
-
Strong scripting/programming skills in Python and Bash.
-
Experience building CI/CD pipelines (GitLab or equivalent).
-
Familiarity with ML workflows (training inference monitoring).
-
Experience with Infrastructure-as-Code (Terraform or CloudFormation).
-
Experience supporting production platforms including incident response and root cause analysis.
Preferred Qualifications -
Experience with data analytics platforms (e.g. Domino SageMaker).
-
Familiarity with ML lifecycle tools (e.g. MLflow).
-
Experience supporting GPU-based workloads or distributed training.
-
Understanding of enterprise MLOps architectures (batch real-time microservices).
-
Knowledge of data processing frameworks and feature engineering pipelines.
Note: Momento USA is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race color religion sex pregnancy sexual orientation gender identity national origin age protected veteran status or disability status.
Job Title: ML Ops Platform Engineer Location: Reston VA (Hybrid 3 days onsite) Duration: 12 Months Contract Interview Process: Face-to-Face Interview Position Overview We are seeking a highly skilled ML Ops Platform Engineer to support enterprise-scale machine learning platform engineering and o...
Job Title: ML Ops Platform Engineer
Location: Reston VA (Hybrid 3 days onsite)
Duration: 12 Months Contract
Interview Process: Face-to-Face Interview Position Overview We are seeking a highly skilled ML Ops Platform Engineer to support enterprise-scale machine learning platform engineering and operations. This role focuses on building managing and optimizing MLOps infrastructure across AWS and Kubernetes (EKS) environments to enable scalable model training deployment and monitoring across development and production ecosystems.
Key Responsibilities Platform Engineering & Operations -
Engineer manage and support MLOps platform components across AWS and EKS environments.
-
Oversee infrastructure used for ML training batch inference and real-time model serving.
-
Ensure high availability resilience and performance across dev test and production.
-
Implement RBAC network policies and scalable namespace architecture within EKS clusters.
Model Deployment & CI/CD Automation -
Build and manage CI/CD pipelines (GitLab) for model packaging container builds vulnerability scanning and automated deployments.
-
Enable standardized release processes including versioning environment promotion and rollback workflows.
-
Integrate CI/CD with ML frameworks model registries and runtime environments.
Container & Kubernetes Workloads -
Design and manage containerized ML workloads within AWS EKS.
-
Implement auto-scaling resource quotas workload isolation and cluster optimization.
-
Support GPU and CPU-based ML training and inference workloads.
Monitoring Observability & Optimization -
Implement logging monitoring and alerting for ML pipelines and endpoints.
-
Optimize compute storage and data transfer usage for cost efficiency.
-
Lead incident management root cause analysis and long-term remediation planning.
Collaboration & Enablement -
Partner with Data Scientists and ML Engineers to operationalize ML solutions.
-
Provide guidance on ML lifecycle management and scalable deployment architectures.
-
Contribute to documentation runbooks and knowledge-sharing initiatives.
Required Qualifications -
3 years of hands-on AWS experience (EKS EC2 S3 IAM CloudWatch ECR).
-
Strong Kubernetes operations experience (preferably AWS EKS).
-
Proficiency with Docker and container orchestration.
-
Strong scripting/programming skills in Python and Bash.
-
Experience building CI/CD pipelines (GitLab or equivalent).
-
Familiarity with ML workflows (training inference monitoring).
-
Experience with Infrastructure-as-Code (Terraform or CloudFormation).
-
Experience supporting production platforms including incident response and root cause analysis.
Preferred Qualifications -
Experience with data analytics platforms (e.g. Domino SageMaker).
-
Familiarity with ML lifecycle tools (e.g. MLflow).
-
Experience supporting GPU-based workloads or distributed training.
-
Understanding of enterprise MLOps architectures (batch real-time microservices).
-
Knowledge of data processing frameworks and feature engineering pipelines.
Note: Momento USA is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race color religion sex pregnancy sexual orientation gender identity national origin age protected veteran status or disability status.
View more
View less