Senior ML Ops Engineer

Mimecast

Job Location:

Columbus, OH - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Senior ML Ops Engineer

Overview

As a Senior ML Ops Engineer at Mimecast you will be a technical leader on the AI Enablement Platform (AIP) team responsible for ensuring that machine learning models and AI agents are deployed scaled observed and maintained reliably across production environments. The AI Enablement Platform serves billions of requests per month across multiple regions powering AI-driven capabilities in email security insider risk data loss prevention and collaboration security for Mimecasts Human Risk Management platform.

This role sits at the intersection of infrastructure engineering and machine learning. You will own the design and implementation of self-service deployment tooling platform resilience and scaling infrastructure and operational best practices that enable ML Engineers and Data Scientists to ship models and agents independently with confidence. You will also be responsible for building and maintaining the developer platform that accelerates the work of ML practitioners across the organization.

This is a senior individual contributor role. You are expected to drive architectural decisions mentor other engineers define standards and operate with a high degree of autonomy. You will collaborate closely with ML Engineers Software Engineers SRE and Cloud Platform teams.

AI-First Engineering at Mimecast

Mimecast is an AI-First engineering organization. Our teams actively leverage AI-powered development tools across all facets of engineering from code development to testing documentation and operations. Were looking for leaders who dont just use AI tools but champion their adoption and establish new ways of working.

Our AI leadership extends beyond how we build to what we build. Our Mihra AI agent delivers 7x faster threat response for customers and were recognized as Agents of Change in Human Risk Management. Engineers here work at the intersection of cutting-edge AI tooling and AI-powered security products that protect organizations worldwide.

What Youll Do:

Self-Service Deployment Tooling: Design and build config-driven validated workflows that enable ML Engineers to deploy models to AIP infrastructure without requiring hands-on ML Ops involvement for each release. This includes automated validation pipelines standardized configuration schemas endpoint provisioning and derisked rollout patterns (canary blue-green rollback).
Platform Resilience and Scaling: Own the reliability and scalability of ML inference infrastructure. Design and tune autoscaling policies against real production traffic patterns implement rate limiting and backpressure mechanisms (HTTP 429 retry-after) at the API layer and build request prioritization frameworks (real-time vs. batch) so the platform protects itself under load without manual intervention or consumer-side changes.
Observability and Monitoring: Develop and maintain the platforms observability stack (metrics logging tracing alerting) so that monitoring is wired in by default for every deployed model and agent. Continuously monitor model performance data drift latency error rates and system health. Build dashboards and alerting that give both the AIP team and consuming teams visibility into their workloads. All team members including leadership participate in an on-call rotation (12 hour shifts).
CI/CD and Automation: Design implement and maintain robust CI/CD pipelines for ML model and infrastructure deployments. Automate testing (functional integration performance) as pre-deployment gates that ML Engineers can trigger themselves with clear pass/fail criteria.
Infrastructure as Code: Manage all AIP infrastructure through Terraform and configuration management tooling. Maintain multi-region deployment capabilities and ensure infrastructure changes are reviewable repeatable and auditable.
Cost Optimization: Implement and enforce cost tagging and allocation at deployment time. Optimize ML inference endpoints for cost-effectiveness including right-sizing instance types managing reserved capacity and providing opinionated endpoint configuration recommendations based on model characteristics.
Agent and LLM Operations: Support the deployment and operational management of AI agents and LLM-based capabilities within the AIPs templatized agent framework. This includes infrastructure for agent hosting tool access configuration and observability for agentic workloads.
Security and Compliance: Ensure ML systems adhere to security best practices including input validation authentication network exposure controls and automated security scanning for model configurations. Support compliance with regulatory requirements relevant to AI systems in the cybersecurity domain.
Technical Leadership: Mentor ML Ops and ML Engineers on operational best practices. Participate in architectural reviews contribute to platform governance and drive engineering standards through documentation code reviews and design discussions. Represent ML Ops in cross-functional planning with SRE Cloud Platform and consuming product teams.

What Youll Bring:

Strong experience with AWS particularly SageMaker (endpoints) EC2 ECS/EKS SQS S3 CloudWatch and IAM.
Proficiency in Python Java (Spring Boot) and Bash scripting.
Deep expertise with Infrastructure as Code (Terraform) and containerization (Docker Kubernetes) including scaling policies ConfigMaps and resource management.
Strong experience designing and maintaining CI/CD pipelines (Jenkins or GitHub Actions preferred) including lifecycle management automated testing and deployment gates for ML workflows.
Demonstrated ability to build and tune autoscaling rate limiting and traffic management systems for high-throughput latency-sensitive services.
Solid understanding of observability tooling and practices (Grafana CloudWatch Open Telemetry) and experience building monitoring for ML model performance in production.
Familiarity with ML frameworks (PyTorch) ML lifecycle tools (MLflow SageMaker) and model serving patterns (real-time inference batch transform async processing). Experience with Triton and ONNX preferred.
Experience working in multi-region production-grade environments handling high request volumes.
Experience and enthusiasm using AI-assisted development tooling to accelerate your own work and the work of ML Engineers.
Comfortable operating as a technical authority across ML Engineering Software Engineering SRE and product teamsinfluencing outcomes through expertise and trust not org chart position

Preferred Qualifications

Experience building self-service developer platforms or internal tooling for ML/data teams.
Exposure to LLM serving infrastructure (model hosting prompt management token-level observability) and agentic AI deployment patterns.
Experience with cost allocation FinOps or cloud cost optimization for ML workloads.
Background in cybersecurity or experience operating AI systems in regulated or security-sensitive environments.
Familiarity with the Model Context Protocol (MCP) or similar agent-tool integration standards.

What We Bring:

Join our AI Enablement Platform (AIP) team to accelerate your career journey working with cutting-edge technologies and contributing to projects that have real customer impact. You will be immersed in a dynamic environment that recognizes and celebrates your achievements.

Mimecast offers formal and on the job learning opportunities maintains a comprehensive benefits package that helps our employees and their family members to sustain a healthy lifestyle and importantly - working in cross functional teams to build your knowledge!

Our Hybrid Model: We provide you with the flexibility to live balanced healthy lives through our hybrid working model that champions both collaborative teamwork and individual flexibility. Employees are expected to come to the office at least two days per week because working together in person:

Fosters a culture of collaboration communication performance and learning.
Drives innovation and creativity within and between teams.
Introduces employees to priorities outside of their immediate realm.
Ensures important interpersonal relationships and connections with one another and our community!

The base salary range for this position is $148000$222000 plus benefits. This range represents the minimum and maximum new hire compensation for this role. The position may also be eligible for incentive plans and additional benefits in accordance with company policy and local regulations. Our salary ranges are determined by role level and location with individual compensation also dependent on factors such as qualifications experience and skills. Final offers will reflect these considerations and may vary accordingly.

#LI-CS1

Belonging at Mimecast

Cybersecurity is a community effort. Thats why were committed to building an inclusive diverse community that celebrates and welcomes everyone unless theyre a cybercriminal of course.

Were proud to be an Equal Opportunity and Affirmative Action Employer and wed encourage you to join us whatever your background. We particularly welcome applicants from traditionally underrepresented groups.

We consider everyone equally: your race age religion sexual orientation gender identity ability marital status nationality or any other protected characteristic wont affect your application.

If you require any adjustments or accommodations due to a disability or any other reason that may help you in your interview process please let us know by emailing

Due to certain obligations to our customers an offer of employment will be subject to your successful completion of applicable background checks conducted in accordance with local law.

It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment.

Required Experience:

Senior IC

Senior ML Ops EngineerOverviewAs a Senior ML Ops Engineer at Mimecast you will be a technical leader on the AI Enablement Platform (AIP) team responsible for ensuring that machine learning models and AI agents are deployed scaled observed and maintained reliably across production environments. The A...