Position Overview
The Site Reliability Engineer (SRE) ensures the reliability scalability and performance of complex distributed systems deployed across private public and hybrid cloud environments. This is a hands-on technical leadership role that combines deep infrastructure knowledge with software engineering expertise to build systems that are automated observable and operationally sustainable.
As a Site Reliability Engineer you will play a central role in evolving the reliability and sustainability of the companys core platform. Your work will directly shape the resilience of mission-critical systems deployed at customer premises influencing not only internal engineering excellence but also the long-term trust and satisfaction of enterprise clients.
The successful candidate will work within a DevIntegration team integrating multiple layers of the product stack to enable automated Kubernetes-based GPU workload provisioning using the Cluster API framework. You will contribute both strategically and tactically - shaping architectural direction while also leading by example in implementation troubleshooting and mentorship.
Key Responsibilities
1. Reliability and Infrastructure Engineering
- Design deploy and maintain highly available fault-tolerant systems running on Kubernetes and bare metal infrastructure.
- Define and manage Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets to balance innovation velocity with operational stability.
- Lead system reliability initiatives ensuring that uptime and performance targets are consistently achieved.
2. System Integration and Automation
- Work within the DevIntegration team to integrate diverse components of the product stack enabling end-to-end cluster provisioning and management.
- Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent repeatable deployments.
- Develop scripts frameworks and tools to eliminate manual interventions and improve system resilience.
3. Architecture and Design Leadership
- Participate in and lead architectural discussions- from high-level design to low-level implementation - to ensure alignment with reliability security and scalability goals.
- Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast-paced environment.
4. Operational Excellence
- Ensure long-term operational sustainability of the deployed product including updates incident management and integration with third-party enterprise systems such as PKI IAM and SIEM.
- Conduct performance optimization capacity planning and root cause analysis to maintain system health.
- Champion automation of day-2 operations such as monitoring scaling patching and recovery.
5. Leadership and Mentorship
- Take ownership beyond engineering scope when needed - leading planning coordination and execution activities with an end-to-end accountability mindset.
- Mentor and support team members sharing deep expertise in reliability engineering infrastructure design and troubleshooting best practices.
- Actively contribute to defining and refining SRE standards and processes across the organization.
Qualifications :
Required Qualifications
8 years of hands-on experience managing mission-critical high-availability production environments.
Proven background in Site Reliability Engineering DevOps or Infrastructure Engineering.
Strong understanding of cloud infrastructure (AWS GCP Azure) and private clouds; experience with bare metal environments is a plus.
Proficiency in at least one general-purpose programming language (Python or Go preferred).
Solid grasp of Infrastructure as Code principles and modern deployment methodologies (Terraform Ansible Helm ArgoCD or similar).
Expertise in containerization and orchestration technologies (Docker Kubernetes Cluster API).
Demonstrated experience with scalable distributed systems and building high-availability architectures.
Deep understanding of Linux systems and networking fundamentals (TCP/IP DNS routing firewalls); experience in a network provider or ISP environment is a plus.
Strong knowledge of modern observability stacks (Prometheus VictoriaMetrics ClickHouse OpenSearch/Elasticsearch) and root-cause analysis techniques.
Familiarity with security and compliance frameworks such as OWASP ISO 27001 CSA and PCI DSS.
Exceptional analytical problem-solving and debugging abilities.
Proven experience working effectively in distributed teams fostering collaboration across multiple functions.
A mindset of extreme ownership - driving continuous improvement accountability and operational excellence.
Preferred Qualifications
Experience in GPU-based workload orchestration and performance optimization.
Familiarity with chaos engineering and proactive reliability testing.
Experience contributing to or leading incident response frameworks and on-call rotations.
Exposure to edge computing AI/ML infrastructure or data-intensive systems.
Prior experience mentoring engineers and influencing technical direction across teams.
Key Competencies
Systemic thinking and architectural foresight
Proactive automation and continuous improvement mindset
Advanced troubleshooting and observability orientation
Ability to operate effectively across multiple technical domains
Leadership presence with strong communication and mentoring skills
Additional Information :
What does Mirantis offer you
Work with an established Silicon Valley leader in the cloud infrastructure industry.
Work with exceptionally passionate talented and engaging colleagues helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies.
Be a part of cutting-edge open-source innovation.
Thrive in the high-energy environment of a young company where openness collaboration risk-taking and continuous growth are valued.
Professional development and training.
Attend conferences and working groups.
Customized workstation (macOS Windows).
Professional development and training.
A competitive compensation package with strong benefits plan and stock options.
It is understood that Mirantis Inc. may use automated decision-making technology (ADMT) for specific employment-related decisions. Opting out of ADMT use is requested for decisions about evaluation and review connected with the specific employment decision for the position applied for. You also have the right to appeal any decisions made by ADMT by sending your request to
By submitting your resume you consent to the processing and storage of your personal data in accordance with applicable data protection laws for the purposes of considering your application for current and future job opportunities.
We are a Leader for Container Management in G2 (#2 after AWS)!
Remote Work :
Yes
Employment Type :
Full-time
Position OverviewThe Site Reliability Engineer (SRE) ensures the reliability scalability and performance of complex distributed systems deployed across private public and hybrid cloud environments. This is a hands-on technical leadership role that combines deep infrastructure knowledge with software...
Position Overview
The Site Reliability Engineer (SRE) ensures the reliability scalability and performance of complex distributed systems deployed across private public and hybrid cloud environments. This is a hands-on technical leadership role that combines deep infrastructure knowledge with software engineering expertise to build systems that are automated observable and operationally sustainable.
As a Site Reliability Engineer you will play a central role in evolving the reliability and sustainability of the companys core platform. Your work will directly shape the resilience of mission-critical systems deployed at customer premises influencing not only internal engineering excellence but also the long-term trust and satisfaction of enterprise clients.
The successful candidate will work within a DevIntegration team integrating multiple layers of the product stack to enable automated Kubernetes-based GPU workload provisioning using the Cluster API framework. You will contribute both strategically and tactically - shaping architectural direction while also leading by example in implementation troubleshooting and mentorship.
Key Responsibilities
1. Reliability and Infrastructure Engineering
- Design deploy and maintain highly available fault-tolerant systems running on Kubernetes and bare metal infrastructure.
- Define and manage Service Level Indicators (SLIs) Service Level Objectives (SLOs) and Error Budgets to balance innovation velocity with operational stability.
- Lead system reliability initiatives ensuring that uptime and performance targets are consistently achieved.
2. System Integration and Automation
- Work within the DevIntegration team to integrate diverse components of the product stack enabling end-to-end cluster provisioning and management.
- Build automation pipelines using Infrastructure as Code (IaC) and CI/CD frameworks to ensure consistent repeatable deployments.
- Develop scripts frameworks and tools to eliminate manual interventions and improve system resilience.
3. Architecture and Design Leadership
- Participate in and lead architectural discussions- from high-level design to low-level implementation - to ensure alignment with reliability security and scalability goals.
- Collaborate with development and product teams to address functional gaps and propose sustainable technical solutions in a fast-paced environment.
4. Operational Excellence
- Ensure long-term operational sustainability of the deployed product including updates incident management and integration with third-party enterprise systems such as PKI IAM and SIEM.
- Conduct performance optimization capacity planning and root cause analysis to maintain system health.
- Champion automation of day-2 operations such as monitoring scaling patching and recovery.
5. Leadership and Mentorship
- Take ownership beyond engineering scope when needed - leading planning coordination and execution activities with an end-to-end accountability mindset.
- Mentor and support team members sharing deep expertise in reliability engineering infrastructure design and troubleshooting best practices.
- Actively contribute to defining and refining SRE standards and processes across the organization.
Qualifications :
Required Qualifications
8 years of hands-on experience managing mission-critical high-availability production environments.
Proven background in Site Reliability Engineering DevOps or Infrastructure Engineering.
Strong understanding of cloud infrastructure (AWS GCP Azure) and private clouds; experience with bare metal environments is a plus.
Proficiency in at least one general-purpose programming language (Python or Go preferred).
Solid grasp of Infrastructure as Code principles and modern deployment methodologies (Terraform Ansible Helm ArgoCD or similar).
Expertise in containerization and orchestration technologies (Docker Kubernetes Cluster API).
Demonstrated experience with scalable distributed systems and building high-availability architectures.
Deep understanding of Linux systems and networking fundamentals (TCP/IP DNS routing firewalls); experience in a network provider or ISP environment is a plus.
Strong knowledge of modern observability stacks (Prometheus VictoriaMetrics ClickHouse OpenSearch/Elasticsearch) and root-cause analysis techniques.
Familiarity with security and compliance frameworks such as OWASP ISO 27001 CSA and PCI DSS.
Exceptional analytical problem-solving and debugging abilities.
Proven experience working effectively in distributed teams fostering collaboration across multiple functions.
A mindset of extreme ownership - driving continuous improvement accountability and operational excellence.
Preferred Qualifications
Experience in GPU-based workload orchestration and performance optimization.
Familiarity with chaos engineering and proactive reliability testing.
Experience contributing to or leading incident response frameworks and on-call rotations.
Exposure to edge computing AI/ML infrastructure or data-intensive systems.
Prior experience mentoring engineers and influencing technical direction across teams.
Key Competencies
Systemic thinking and architectural foresight
Proactive automation and continuous improvement mindset
Advanced troubleshooting and observability orientation
Ability to operate effectively across multiple technical domains
Leadership presence with strong communication and mentoring skills
Additional Information :
What does Mirantis offer you
Work with an established Silicon Valley leader in the cloud infrastructure industry.
Work with exceptionally passionate talented and engaging colleagues helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies.
Be a part of cutting-edge open-source innovation.
Thrive in the high-energy environment of a young company where openness collaboration risk-taking and continuous growth are valued.
Professional development and training.
Attend conferences and working groups.
Customized workstation (macOS Windows).
Professional development and training.
A competitive compensation package with strong benefits plan and stock options.
It is understood that Mirantis Inc. may use automated decision-making technology (ADMT) for specific employment-related decisions. Opting out of ADMT use is requested for decisions about evaluation and review connected with the specific employment decision for the position applied for. You also have the right to appeal any decisions made by ADMT by sending your request to
By submitting your resume you consent to the processing and storage of your personal data in accordance with applicable data protection laws for the purposes of considering your application for current and future job opportunities.
We are a Leader for Container Management in G2 (#2 after AWS)!
Remote Work :
Yes
Employment Type :
Full-time
View more
View less