The Red Hat IT Automation & Intelligence Evolution (AIE) team is seeking a senior site reliability engineer to drive our strategic shift from traditional operations to intelligent automation and this pivotal role you will serve as a technical lead for reliability and a strategic consultant to the wider organization.
You will design and implement self-service platforms drive AI-driven operational workflows and spearhead our alert-noise reduction campaigns. You will act as a technical leader mentoring junior engineers and partnering with internal teams to identify high-ROI automation opportunities. Your goal is not just to resolve issues but to permanently remove the hidden tax of toil and interruptions through engineering and AI adoption.
Reliability Engineering & Standards
Define and Enforce SLOs: Lead the definition of Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services managing Error Budgets to balance feature velocity with system stability.
AIOps Implementation: Drive the adoption of AIOps solutions (including Anomaly Detection and Predictive Alerting) to reduce incident volume and improve Mean Time to Resolution (MTTR).
Resilience Engineering: Design and lead Chaos Engineering experiments (e.g. fault injection) to validate system recovery and uncover weaknesses before they impact production.
Automation & Efficiency
Eliminate Toil: Identify manual repetitive work patterns and engineer complex automation solutions to eliminate them aiming to boost overall team capacity.
Intelligent Workflows: Move beyond basic scripting to build intelligent agents and workflows using tools like the Model Context Protocol (MCP) and LLM integrations to automate decision-making processes.
Infrastructure as Code: Maintain and evolve the Infrastructure as Code ecosystem ensuring robust configuration management and version control standards are applied across the environment.
Enablement & Leadership
Internal Consulting: Act as a subject matter expert engaging with other engineering teams to scope their automation needs and help them build/manage their own workflows.
Incident Command: Lead high-severity incident response efforts serving as the Incident Commander when necessary.
Root Cause Analysis: Facilitate blameless post-mortems focusing on systemic root causes (Graph Algorithm Design Blast Radius Analysis) rather than human error to prevent recurrence.
Mentorship: Mentor junior SREs conducting code reviews and guiding them through complex troubleshooting and systems engineering principles.
Technical Competency:
Programming: Proficiency in Python or Go with experience in building modular scalable software.
Automation: Proficiency with Ansible for configuration management orchestration and automation workflows.
Observability Stack: Expert-level knowledge of monitoring ecosystems specifically the TIGK Stack (Telegraf InfluxDB Grafana Kapacitor) and Prometheus.
Cloud & Containerization: Deep understanding of Linux environments Kubernetes/OpenShift and public cloud infrastructure (AWS/Azure/GCP).
SRE Methodology:
Demonstrated experience designing and implementing SLIs SLOs and Error Budgets.
Proven track record of Toil Reduction strategies and implementation.
Experience with Incident Management lifecycles (escalation policies paging and post-mortems).
Soft Skills:
Growth Mindset: Open-minded approach to problem-solving and a demonstrated willingness to learn and adopt new technologies.
Strategic Thinking: Ability to translate business goals into technical roadmaps.
Communication: Strong ability to explain complex reliability concepts to non-SRE teams and leadership.
Automation Platforms: Experience with Ansible Automation Platform (AAP) or similar configuration management tools for enterprise-scale environments.
AI/LLM Integration: Experience with Model Context Protocol (MCP) Claude Plugin development or integrating LLMs into operational workflows.
Data Science for Ops: Experience with regression data or algorithms for predictive alerting.
Security: Experience with hardening systems (Bastion Hosts) and managing security policies within automation workflows.
#LI-AK1
About Red Hat
Red Hat is the worlds leading provider of enterprise open source software solutions using a community-powered approach to deliver high-performing Linux cloud container and Kubernetes technologies. Spread across 40 countries our associates work flexibly across work environments from in-office to office-flex to fully remote depending on the requirements of their role. Red Hatters are encouraged to bring their best ideas no matter their title or tenure. Were a leader in open source because of our open and inclusive environment. We hire creative passionate people ready to contribute their ideas help solve complex problems and make an impact.
Inclusion at Red Hat
Red Hats culture is built on the open source principles of transparency collaboration and inclusion where the best ideas can come from anywhere and anyone. When this is realized it empowers people from different backgrounds perspectives and experiences to come together to share ideas challenge the status quo and drive innovation. Our aspiration is that everyone experiences this culture with equal opportunity and access and that all voices are not only heard but also celebrated. We hope you will join our celebration and we welcome and encourage applicants from all the beautiful dimensions that compose our global village.
Equal Opportunity Policy (EEO)
Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race color religion sex sexual orientation gender identity national origin ancestry citizenship age veteran status genetic information physical or mental disability medical condition marital status or any other basis prohibited by law.
Required Experience:
Senior IC
We revolutionized the operating system with Red Hat® Enterprise Linux®. Now, we have a broad portfolio, including hybrid cloud infrastructure, middleware, agile integration, cloud-native application development, and management and automation solutions. With Red Hat technologies, compa ... View more