Site Reliability Engineer SRE – ML platform

HG Solutions

Not Interested
Bookmark
Report This Job

profile Job Location:

Austin - USA

profile Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

Title: Site Reliability Engineer SRE ML platform

Location: Austin TX OR Sunnyvale CA

Type: FTE/ FTC

Responsibilities:

  • Continuous Deployment using GitHub Actions Flux Kustomize
  • Design and implement cloud solutions build MLOps on cloud AWS
  • Data science model containerization deployment using docker VLLM Kubernetes
  • Communicate with a team of data scientists data engineers and architects document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
  • Knowledge of ML models and LLM

Qualifications:

  • 6 years of experience in ML Ops with strong knowledge in Kubernetes Python MongoDB and AWS.
  • Good understanding of Apache SOLR.
  • Proficient with Linux administration.
  • Knowledge of ML models and LLM.
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow MLFlow DataRobot Airflow etc. experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow Airflow Argo etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing benchmarking and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Note: Focus is to have 60% SRE and 40% ML Ops

Title: Site Reliability Engineer SRE ML platform Location: Austin TX OR Sunnyvale CA Type: FTE/ FTC Responsibilities: Continuous Deployment using GitHub Actions Flux Kustomize Design and implement cloud solutions build MLOps on cloud AWS Data science model containerization deployment using d...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting