Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailWe are seeking a highly skilled Lead DevOps Engineer with strong On-Premise infrastructure expertise to join our team and drive the end-to-end deployment scalability and operationalization of machine learning models in production. You will collaborate closely with data scientists data engineers and DevOps teams to ensure seamless CI/CD reproducibility monitoring and governance of ML pipelines.
Key Responsibilities
Design implement and maintain CI/CD pipelines for deploying and monitoring microservices efficiently in on-premise environments.
Manage infrastructure as code using Terraform (or equivalent on-prem solutions) for repeatable and scalable provisioning.
Deploy and optimize containerized applications using Docker across on-premise environments integrating with systems such as Harbor (or other private registries) Vault and on-prem messaging/file storage solutions.
Apply best practices for securing Docker images including vulnerability scanning reducing image size and optimizing build efficiency.
Implement and maintain centralized logging monitoring and alerting systems (e.g. Prometheus Grafana ELK stack) to ensure system reliability and observability.
Ensure security best practices across on-prem environments including secrets management access control and compliance with organizational policies.
(Nice to have) Design and manage multi-client architectures within shared pipelines and storage solutions (e.g. NFS Object Storage).
Qualifications :
6 years of experience in DevOps or MLOps with a strong focus on production-grade ML solutions in on-premise infrastructure.
Strong expertise in CI/CD tooling container orchestration (Docker Kubernetes on-prem clusters) and on-premise infrastructure security.
Proficiency in Terraform (or Ansible Puppet or similar tools) for infrastructure automation.
Deep understanding of Docker including best practices for securing optimizing and managing images.
Experience implementing centralized logging and monitoring using on-prem tools (e.g. ELK Prometheus Grafana).
Experience with security best practices including secrets management role-based access and compliance in an on-premise environment.
Experience with Docker Compose for local development and multi-container orchestration.
Additional Information :
Experience with Databricks on private cloud or equivalent on-prem data processing platforms.
Experience deploying securing and managing vector databases.
Hands-on experience with MLFlow for model tracking and deployment.
Familiarity with best practices for multi-client architecture in shared on-prem pipelines and storage.
Python experience for microservices development if interested in contributing to application code.
Experience with Docker Compose for local development and multi-container orchestration.
Remote Work :
No
Employment Type :
Full-time
Full-time