Site Reliability Engineer SRE – ML platform

Not Interested
Bookmark
Report This Job

profile Job Location:

Sunnyvale, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

Title: Site Reliability Engineer SRE ML platform

Location: Austin TX OR Sunnyvale CA (Onsite)

Type: Direct Fulltime OR 12 Months Contract

Responsibilities:

  • Continuous Deployment using GitHub Actions Flux Kustomize
  • Design and implement cloud solutions build MLOps on cloud AWS
  • Data science model containerization deployment using docker VLLM Kubernetes
  • Communicate with a team of data scientists data engineers and architects document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
  • Knowledge of ML models and LLM

Qualifications:

  • 6 years of experience in ML Ops with strong knowledge in Kubernetes Python MongoDB and AWS.
  • Good understanding of Apache SOLR.
  • Proficient with Linux administration.
  • Knowledge of ML models and LLM.
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow MLFlow DataRobot Airflow etc. experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow Airflow Argo etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing benchmarking and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Note: Focus is to have 60% SRE and 40% ML Ops

With Regards

Chanakya Bhadrachalam

Sr. IT Recruiter

EmaiL:

Title: Site Reliability Engineer SRE ML platform Location: Austin TX OR Sunnyvale CA (Onsite) Type: Direct Fulltime OR 12 Months Contract Responsibilities: Continuous Deployment using GitHub Actions Flux Kustomize Design and implement cloud solutions build MLOps on cloud AWS Data science mo...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting