Senior Machine Learning Engineer - I (MLOps/LLMOps)
As a Senior Machine Learning Engineer - MLOps/LLMOps you will design build and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. Youll architect robust pipelines for model training evaluation deployment and monitoring while ensuring reliability observability and efficiency at scale. This role collaborates closely with ML Engineers Data Scientists and Product teams to operationalize AI/ML solutions from prototype to production.
Responsibilities
Platform & Infrastructure
- Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning model training evaluation deployment and monitoring
- Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing validation and rollback capabilities
- Develop infrastructure-as-code (IaC) for reproducible version-controlled ML environments
- Architect model serving infrastructure with auto-scaling A/B testing and canary deployment capabilities
LLM Operations
- Build platforms for LLM fine-tuning prompt management and experimentation at scale
- Implement evaluation frameworks for LLM performance quality safety and cost optimization
- Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
- Establish LLM observability: token usage tracking latency monitoring prompt/response logging and cost attribution
Operational Excellence
- Own uptime reliability and performance of ML/LLM services (SLIs/SLOs)
- Implement comprehensive monitoring alerting and incident response for ML systems
- Participate in on-call rotations and drive post-incident reviews to improve system resilience
- Build automation and tooling to reduce toil and accelerate ML development velocity
Collaboration & Leadership
- Partner with ML Engineers and Data Scientists to translate research into production-ready systems
- Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
- Mentor team members on MLOps best practices production ML patterns and operational excellence
- Drive technical decisions on tooling frameworks and architectural patterns
Required Qualifications and Skills
- Education: B.S./M.S./Ph.D. in Computer Science Engineering or related technical field
- Experience: 4 years of software engineering experience with 2 years focused on MLOps/LLMOps
- MLOps Expertise:
- Production experience with ML model serving frameworks (e.g. TensorFlow Serving TorchServe Triton)
- Hands-on with ML experiment tracking and model registry tools (MLflow Weights & Biases Kubeflow)
- Proficiency in workflow orchestration (Airflow Prefect Kubeflow Pipelines Metaflow)
- LLMOps Expertise:
- Experience with LLM deployment fine-tuning and evaluation frameworks (e.g. vLLM LangChain LlamaIndex)
- Knowledge of prompt engineering RAG architectures and LLM application patterns
- Familiarity with LLM observability tools (e.g. LangSmith Arize WhyLabs)
- Cloud & Infrastructure:
- Strong experience with major cloud providers (AWS GCP or Azure) and ML-specific services (SageMaker Vertex AI Azure ML Bedrock)
- Proficiency in containerization (Docker Kubernetes) and infrastructure-as-code (Terraform CloudFormation Pulumi)
- Experience with microservices architecture and API development (REST gRPC)
- Software Engineering:
- Strong programming skills in Python terraform and Helm; familiarity with Go Java or Rust is a plus
- Deep understanding of CI/CD practices and tools (GitHub Actions GitLab CI Jenkins ArgoCD)
- Experience with monitoring and observability stacks (Prometheus Grafana DataDog ELK)
- Operational Excellence:
- Track record of managing production systems with defined SLIs/SLOs
- Experience with on-call rotations incident management and reliability engineering practices
Desired Qualifications and Skills
- Experience building internal ML platforms or developer tooling used by multiple teams
- Hands-on with distributed training frameworks (Ray Horovod DeepSpeed)
- Knowledge of model optimization techniques (quantization distillation pruning)
- Familiarity with feature stores (Feast Tecton) and data versioning tools (DVC LakeFS)
- Understanding of ML security best practices model governance and compliance requirements
- Experience with cost optimization and resource management for large-scale ML workloads
- Contributions to open-source MLOps/LLMOps projects
- Background in applied ML or data science with practical model development experience
About Us
Sumo Logic Inc. helps make the digital world secure fast and reliable by unifying critical security and operational data through its Intelligent Operations Platform. Built to address the increasing complexity of modern cybersecurity and cloud operations challenges we empower digital teams to move from reaction to readinesscombining agentic AI-powered SIEM and log analytics into a single platform to detect investigate and resolve modern challenges. Customers around the world rely on Sumo Logic for trusted insights to protect against security threats ensure reliability and gain powerful insights into their digital environments. For more information visit.
Sumo Logic Privacy Policy. Employees will be responsible for complying with applicable federal privacy laws and regulations as well as organizational policies related to data protection.
The expected annual base salary range for this position is $158000 - $185000. Compensation varies based on a variety of factors which include (but arent limited to) role level skills and competencies qualifications knowledge location and addition to base pay certain roles are eligible to participate in our bonus or commission plans as well as our benefits offerings and equity awards.
Must be authorized to work in the United States at time of hire and for duration of employment. At this time we are not able to offer nonimmigrant visa sponsorship for this position.
Required Experience:
Senior IC
Senior Machine Learning Engineer - I (MLOps/LLMOps)As a Senior Machine Learning Engineer - MLOps/LLMOps you will design build and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. Youll architect robust pipelines for model training evaluation d...
Senior Machine Learning Engineer - I (MLOps/LLMOps)
As a Senior Machine Learning Engineer - MLOps/LLMOps you will design build and scale production-grade infrastructure and platforms that enable the full lifecycle of ML and LLM systems. Youll architect robust pipelines for model training evaluation deployment and monitoring while ensuring reliability observability and efficiency at scale. This role collaborates closely with ML Engineers Data Scientists and Product teams to operationalize AI/ML solutions from prototype to production.
Responsibilities
Platform & Infrastructure
- Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning model training evaluation deployment and monitoring
- Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing validation and rollback capabilities
- Develop infrastructure-as-code (IaC) for reproducible version-controlled ML environments
- Architect model serving infrastructure with auto-scaling A/B testing and canary deployment capabilities
LLM Operations
- Build platforms for LLM fine-tuning prompt management and experimentation at scale
- Implement evaluation frameworks for LLM performance quality safety and cost optimization
- Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
- Establish LLM observability: token usage tracking latency monitoring prompt/response logging and cost attribution
Operational Excellence
- Own uptime reliability and performance of ML/LLM services (SLIs/SLOs)
- Implement comprehensive monitoring alerting and incident response for ML systems
- Participate in on-call rotations and drive post-incident reviews to improve system resilience
- Build automation and tooling to reduce toil and accelerate ML development velocity
Collaboration & Leadership
- Partner with ML Engineers and Data Scientists to translate research into production-ready systems
- Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
- Mentor team members on MLOps best practices production ML patterns and operational excellence
- Drive technical decisions on tooling frameworks and architectural patterns
Required Qualifications and Skills
- Education: B.S./M.S./Ph.D. in Computer Science Engineering or related technical field
- Experience: 4 years of software engineering experience with 2 years focused on MLOps/LLMOps
- MLOps Expertise:
- Production experience with ML model serving frameworks (e.g. TensorFlow Serving TorchServe Triton)
- Hands-on with ML experiment tracking and model registry tools (MLflow Weights & Biases Kubeflow)
- Proficiency in workflow orchestration (Airflow Prefect Kubeflow Pipelines Metaflow)
- LLMOps Expertise:
- Experience with LLM deployment fine-tuning and evaluation frameworks (e.g. vLLM LangChain LlamaIndex)
- Knowledge of prompt engineering RAG architectures and LLM application patterns
- Familiarity with LLM observability tools (e.g. LangSmith Arize WhyLabs)
- Cloud & Infrastructure:
- Strong experience with major cloud providers (AWS GCP or Azure) and ML-specific services (SageMaker Vertex AI Azure ML Bedrock)
- Proficiency in containerization (Docker Kubernetes) and infrastructure-as-code (Terraform CloudFormation Pulumi)
- Experience with microservices architecture and API development (REST gRPC)
- Software Engineering:
- Strong programming skills in Python terraform and Helm; familiarity with Go Java or Rust is a plus
- Deep understanding of CI/CD practices and tools (GitHub Actions GitLab CI Jenkins ArgoCD)
- Experience with monitoring and observability stacks (Prometheus Grafana DataDog ELK)
- Operational Excellence:
- Track record of managing production systems with defined SLIs/SLOs
- Experience with on-call rotations incident management and reliability engineering practices
Desired Qualifications and Skills
- Experience building internal ML platforms or developer tooling used by multiple teams
- Hands-on with distributed training frameworks (Ray Horovod DeepSpeed)
- Knowledge of model optimization techniques (quantization distillation pruning)
- Familiarity with feature stores (Feast Tecton) and data versioning tools (DVC LakeFS)
- Understanding of ML security best practices model governance and compliance requirements
- Experience with cost optimization and resource management for large-scale ML workloads
- Contributions to open-source MLOps/LLMOps projects
- Background in applied ML or data science with practical model development experience
About Us
Sumo Logic Inc. helps make the digital world secure fast and reliable by unifying critical security and operational data through its Intelligent Operations Platform. Built to address the increasing complexity of modern cybersecurity and cloud operations challenges we empower digital teams to move from reaction to readinesscombining agentic AI-powered SIEM and log analytics into a single platform to detect investigate and resolve modern challenges. Customers around the world rely on Sumo Logic for trusted insights to protect against security threats ensure reliability and gain powerful insights into their digital environments. For more information visit.
Sumo Logic Privacy Policy. Employees will be responsible for complying with applicable federal privacy laws and regulations as well as organizational policies related to data protection.
The expected annual base salary range for this position is $158000 - $185000. Compensation varies based on a variety of factors which include (but arent limited to) role level skills and competencies qualifications knowledge location and addition to base pay certain roles are eligible to participate in our bonus or commission plans as well as our benefits offerings and equity awards.
Must be authorized to work in the United States at time of hire and for duration of employment. At this time we are not able to offer nonimmigrant visa sponsorship for this position.
Required Experience:
Senior IC
View more
View less