Senior ML Infrastructure Engineer

Ho Chi Minh City - Vietnam

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Our Client is seeking a Senior ML Infrastructure Engineer to build the data and model pipelines that power our AI development lifecycle. This role focuses on creating reliable automated processes for model training evaluation and deployment enabling ML Engineers and Data Scientists to move from experiment to production with speed and confidence.

You will work closely with the platform team (handling Terraform clusters observability and CI/CD) to ensure seamless integration between process-level automation and infrastructure reliability.

Responsibilities:

Design and implement reproducible ML pipelines for data ingestion feature engineering model training evaluation and deployment.
Develop internal tooling and workflows for experiment tracking model registry and automated evaluation.
Develop automated model evaluation and validation pipelines integrating statistical tests fairness checks and performance regressions.
Optimize training workflows for distributed GPUs and cloud resource efficiency.
Collaborate with ML Engineers to deploy and monitor models in production using modern CI/CD practices and to abstract complex infra into easy-to-use templates or SDKs that improve team productivity
Collaborate with Data Scientists to standardize experiment setup metadata and artifact management.
Establish observability standards for ML systems (logging metrics tracing alerts).
Partner with the US-based platform and data engineering teams to align pipeline needs with platform capabilities (e.g. storage compute orchestration secrets management).

Qualifications:

Bachelors or Masters in Computer Science Engineering or a related technical field.
3 years of experience in software engineering or ML infrastructure roles.
Proficiency in Python SQL and Shell scripting.
Hands-on experience with cloud services (AWS/GCP/Azure) and container orchestration (Docker Kubernetes).
Experience building data or model pipelines using tools like Airflow Kubeflow or MLFlow.
Familiarity with model deployment frameworks (TensorFlow Serving Triton VLLM FastAPI).
Experience implementing CI/CD for ML systems and infrastructure as code (e.g. Terraform).
Fluent English communication to collaborate with US-based teams.

Preferred Qualifications:

Experience with GPU workload optimization and distributed training.
Experience setting up feature stores vector databases or retrieval infrastructure for LLMs.
Familiarity with model monitoring and evaluation in production environments.
Strong understanding of security and compliance in ML infrastructure.

You will work closely with the platform team (handling Terraform clusters observability and CI/CD) to ensure seamless integration between process-level automation and infrastructure reliability.

Responsibilities:

Design and implement reproducible ML pipelines for data ingestion feature engineering model training evaluation and deployment.
Develop internal tooling and workflows for experiment tracking model registry and automated evaluation.
Develop automated model evaluation and validation pipelines integrating statistical tests fairness checks and performance regressions.
Optimize training workflows for distributed GPUs and cloud resource efficiency.
Collaborate with ML Engineers to deploy and monitor models in production using modern CI/CD practices and to abstract complex infra into easy-to-use templates or SDKs that improve team productivity
Collaborate with Data Scientists to standardize experiment setup metadata and artifact management.
Establish observability standards for ML systems (logging metrics tracing alerts).
Partner with the US-based platform and data engineering teams to align pipeline needs with platform capabilities (e.g. storage compute orchestration secrets management).

Qualifications:

Bachelors or Masters in Computer Science Engineering or a related technical field.
3 years of experience in software engineering or ML infrastructure roles.
Proficiency in Python SQL and Shell scripting.
Hands-on experience with cloud services (AWS/GCP/Azure) and container orchestration (Docker Kubernetes).
Experience building data or model pipelines using tools like Airflow Kubeflow or MLFlow.
Familiarity with model deployment frameworks (TensorFlow Serving Triton VLLM FastAPI).
Experience implementing CI/CD for ML systems and infrastructure as code (e.g. Terraform).
Fluent English communication to collaborate with US-based teams.

Preferred Qualifications:

Experience with GPU workload optimization and distributed training.
Experience setting up feature stores vector databases or retrieval infrastructure for LLMs.
Familiarity with model monitoring and evaluation in production environments.
Strong understanding of security and compliance in ML infrastructure.