Title: Site Reliability Engineer SRE ML platform
Location: Austin TX or Sunnyvale CA
ONLY W2
Responsibilities:
- Continuous Deployment using GitHub Actions Flux Kustomize
- Design and implement cloud solutions build MLOps on cloud AWS
- Data science model containerization deployment using docker VLLM Kubernetes
- Communicate with a team of data scientists data engineers and architects document the processes
- Develop and deploy scalable tools and services for our clients to handle machine learning training and inference.
- Knowledge of ML models and LLM
Qualifications:
- 6 years of experience in ML Ops with strong knowledge in Kubernetes Python MongoDB and AWS.
- Good understanding of Apache SOLR.
- Proficient with Linux administration.
- Knowledge of ML models and LLM.
- Ability to understand tools used by data scientists and experience with software development and test automation
- Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
- Experience working with cloud computing and database systems
- Experience building custom integrations between cloud-based systems using APIs
- Experience developing and maintaining ML systems built with open-source tools
- Experience with MLOps Frameworks like Kubeflow MLFlow DataRobot Airflow etc. experience with Docker and Kubernetes
- Experience developing containers and Kubernetes in cloud computing environments
- Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow Airflow Argo etc.)
- Ability to translate business needs to technical requirements
- Strong understanding of software testing benchmarking and continuous integration
- Exposure to machine learning methodology and best practices
- Good communication skills and ability to work in a team