Senior Site Reliability Engineer

Johannesburg - South Africa

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Position Specifications

Attribute

Details

Position Title

Senior Site Reliability Engineer (RHEL Specialist)

Primary Location

Remote

Minimum Experience

5 Years in Systems Engineering DevOps or SRE roles

Reporting Structure

Reports to the Head of Infrastructure & Platform Engineering

Language Requirements

Portuguese and English

1.3 Role Summary

The Senior Site Reliability Engineer (RHEL Specialist) is a critical technical leadership role responsible for ensuring that our production environments are resilient performant and highly automated. Unlike traditional systems administration this role treats infrastructure as a software problem. You will be the primary custodian of our Red Hat Enterprise Linux (RHEL) ecosystem applying advanced engineering practices to manage thousands of nodes across on-premise virtualization and public cloud platforms.

Your mission is to bridge the gap between software development and systems operations by designing self-healing systems and robust Ansible-based automation frameworks. You will be expected to proactively identify system inefficiencies optimize kernel performance and architect CI/CD pipelines that empower development teams while maintaining strict production stability.

Core Mission Statement:

To engineer a world-class RHEL environment where manual intervention is the exception not the rule. Through advanced automation and deep observability you will ensure our services achieve 99.99% availability while enabling rapid low-risk software delivery.

1.4 Ideal Candidate Profile

The ideal candidate is a proactive problem-solver with a software-first approach to infrastructure. We are looking for an individual who:

Possesses a deep-seated expertise in the RHEL kernel system internals and performance tuning.

Views Ansible and Python as their primary tools for managing complexity at scale.

Demonstrates a proven track record of managing Docker and Kubernetes workloads in high-traffic production settings.

Is naturally curious and proactive often identifying and resolving system bottlenecks before they trigger an alert.

Thrives in a collaborative DevOps culture and is comfortable navigating the complexities of hybrid-cloud environments (AWS Azure or GCP).

2. Key Responsibilities

The Senior Site Reliability Engineer (RHEL Specialist) is responsible for the availability latency performance efficiency change management monitoring emergency response and capacity planning of our enterprise Linux services. This role demands a unique blend of systems engineering expertise and software development skills to build and run large-scale distributed fault-tolerant systems.

2.1 Automation & Infrastructure Orchestration

Ansible Framework Design: Architect implement and maintain enterprise-grade automation solutions using Ansible for our Red Hat Enterprise Linux (RHEL) fleet. This includes developing custom Ansible roles modules and playbooks to automate system provisioning configuration management and patching.

Standard Operating Environment (SOE): Maintain and evolve the RHEL SOE across hybrid-cloud environments ensuring consistency between on-premise virtualization (VMware/KVM) and public cloud instances.

Infrastructure as Code (IaC): Transform manual infrastructure workflows into automated code-based processes ensuring that every component of the RHEL environment is version-controlled and reproducible.

2.2 Development & Toil Reduction

Scripting & Tooling: Develop advanced scripts in Python and Bash to automate repetitive operational tasks (toil). You will be expected to build internal tools that enhance the productivity of the entire engineering organization.

System Integration: Write code to integrate infrastructure components with internal APIs monitoring tools and service management platforms to create seamless end-to-end automated workflows.

Kernel & OS Optimization: Leverage deep Linux knowledge to tune system parameters and develop automated checks for system health and performance bottlenecks.

2.3 CI/CD & Release Engineering

Pipeline Construction: Build and optimize robust CI/CD pipelines using Jenkins or GitLab CI to automate the testing and deployment of infrastructure code and application workloads.

Deployment Strategies: Implement modern deployment methodologies such as Blue-Green or Canary releases to minimize downtime and risk during system updates.

Artifact Management: Oversee the lifecycle of container images and software packages ensuring they are securely built scanned and distributed across the RHEL landscape.

Focus on Proactive Optimization:

A primary accountability of this role is the continuous identification of system inefficiencies. You will not just react to failures; you will actively hunt for potential failure modes performance regressions and architectural weaknesses proposing and implementing automated fixes before they impact the business.

2.4 Service Level Management & Observability

Defining SLOs/SLIs: Collaborate with product and development teams to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that accurately represent the user experience.

Monitoring & Alerting: Design and implement comprehensive monitoring frameworks that provide deep visibility into RHEL system health and application performance ensuring alerts are actionable and signal-rich.

Capacity Planning: Utilize performance data to forecast future resource requirements and automate scaling actions across the hybrid-cloud estate.

2.5 Incident Response & Reliability Leadership

Incident Management: Participate in a tiered on-call rotation and lead the response for critical production incidents. You will be responsible for stabilizing systems under pressure while maintaining clear communication with stakeholders.

Post-Mortem Analysis: Conduct blameless post-mortems to identify the root causes of incidents and drive the implementation of automated preventative measures.

SRE Evangelism: Act as a champion for reliability engineering. Mentor other engineers on SRE best practices RHEL optimization and the cultural shift toward everything-as-code.

3. Required Technical Skills & Qualifications

The Senior Site Reliability Engineer (RHEL Specialist) must possess a deep technical foundation in enterprise Linux systems and modern DevOps methodologies. We are seeking a candidate who has transitioned from traditional systems administration to an automation-first engineering mindset capable of managing complex distributed environments.

Mandatory Requirement:

A minimum of 5 years of professional experience in Site Reliability Engineering DevOps or Senior Linux Systems Administration is required. Candidates must demonstrate high-level proficiency in managing production RHEL environments at scale.

3.1 Enterprise Linux Mastery (RHEL Focus)

As our primary operating system environment deep expertise in Red Hat Enterprise Linux is non-negotiable. Requirements include:

System Administration: Expert knowledge of RHEL 7 8 and 9 internals including LVM systemd GRUB and advanced storage management.

Performance Tuning: Proficiency in kernel-level tuning using tools such as tuned top iostat vmstat and perf to optimize workloads.

Security Hardening: Advanced implementation of SELinux (policy management context troubleshooting) and system security protocols (Firewalld IPTables OpenSCAP).

Lifecycle Management: Proven experience with Red Hat Satellite including Capsule management Content Views and automated patch scheduling via Subscription Manager.

3.2 Automation & Configuration Management

Ansible Expertise: Expert-level proficiency with Ansible Core and Ansible Automation Platform (Tower). This includes architecting complex roles utilizing Jinja2 templating managing dynamic inventories and developing custom Ansible modules to meet specific infrastructure needs.

Infrastructure as Code (IaC): Strong familiarity with managing stateful infrastructure via code focusing on idempotency and drift detection.

3.3 Programming & Scripting

The ability to read write and debug code is essential for eliminating toil and integrating systems.

Python: Strong proficiency for developing automation tools API integrations and monitoring scripts.

Bash: Expert shell scripting skills for system-level automation boot-time scripts and complex command-line tasks.

3.4 Cloud Containers & Virtualization

Category

Technical Requirements

Containerization

Hands-on experience with Docker/Podman and enterprise orchestration via Kubernetes or Red Hat OpenShift.

Virtualization

Experience managing RHEL workloads on VMware vSphere and/or KVM (libvirt/QEMU) environments.

Cloud Platforms

Practical experience managing Linux infrastructure in at least one major cloud provider: AWS Azure or GCP.

CI/CD Tooling

Proficiency in building and managing deployment pipelines using Jenkins or GitLab CI.

3.5 Certifications & Education

Education: Bachelors degree in Computer Science Information Technology or a related field (or equivalent professional experience).

Preferred Certifications: Red Hat Certified Engineer (RHCE) Red Hat Certified Architect (RHCA) or Certified Kubernetes Administrator (CKA) are highly desirable and will be given preference.

Professional Knowledge: Familiarity with SRE principles (based on the Google SRE workbook) and ITIL frameworks.

4. Desired Soft Skills

While technical proficiency in RHEL and automation is paramount the effectiveness of a Site Reliability Engineer is heavily dependent on their ability to navigate complex organizational dynamics and maintain a calm analytical mindset during system failures. We are seeking a candidate who embodies the cultural tenets of SRE: a blend of high empathy radical transparency and a relentless drive for efficiency.

4.1 Proactive Problem-Solving & Strategic Thinking

The ideal candidate does not wait for an alert to trigger before acting. You should possess the ability to look at complex systems holistically and identify latent defects or architectural weaknesses. We value engineers who think strategically about toil and can justify automation efforts based on long-term reliability gains rather than short-term fixes.

4.2 Radical Ownership & Accountability

Sense of Ownership: You treat the production environment as your own taking personal pride in its uptime and performance.

Accountability: When systems fail you focus on remediation and long-term prevention. You are comfortable standing by your technical decisions while remaining open to constructive feedback during blameless post-mortems.

4.3 Communication & Cross-Functional Collaboration

SREs sit at the intersection of Development and Operations. Success in this role requires:

Bridging the Gap: The ability to translate deep technical RHEL issues into business impact for stakeholders and actionable requirements for developers.

Active Listening: Working closely with software engineers to understand their deployment pain points and designing automation that solves real-world development hurdles.

Documentation: A commitment to maintaining clear concise and accurate runbooks and architectural diagrams to ensure knowledge is shared across the team.

The SRE Mindset:

We are looking for an individual who remains calm under fire. The ability to maintain focus and lead a technical team through a high-pressure P1 incident is just as important as the code you write.

4.4 Methodical Troubleshooting & Crisis Management

When faced with an unknown-unknown error you employ a scientific and methodical approach. You should be able to:

Isolate variables systematically to identify root causes in distributed systems.

Utilize observability data (logs metrics traces) to build a narrative of system failure.

Prioritize restoration of service over finding the perfect fix during an active outage.

4.5 Passion for Continuous Improvement & Learning

The technology landscapeespecially within the Red Hat and Cloud Native ecosystemsis constantly evolving. We value candidates who:

Are self-driven learners who keep pace with new Ansible modules RHEL features and SRE methodologies.

Actively seek out ways to improve not just the systems but also the teams internal processes and culture.

Enjoy mentoring junior engineers and fostering an environment of technical excellence.

5. Reporting Structure

The Senior Site Reliability Engineer (RHEL Specialist) occupies a pivotal position within the technology organization serving as a bridge between core infrastructure management and modern software delivery. This role is designed for a technical leader who can influence architectural decisions while remaining deeply embedded in daily operational excellence.

5.1 Direct Reporting Line

This position reports directly to the Head of Site Reliability Engineering (or the Director of Cloud & Platform Infrastructure). This reporting line is structured to provide the SRE team with the necessary mandate to prioritize reliability automation and performance tuning over feature velocity when system stability is at risk.

Regular engagements with leadership will involve:

Weekly strategic updates on infrastructure health and Error Budget consumption.

Participation in quarterly planning sessions to align automation roadmaps with business objectives.

Direct involvement in Post-Mortem reviews with executive stakeholders for high-priority incidents.

5.2 Team Environment & The Squad Model

While reporting into the centralized SRE leadership the candidate will function within a cross-functional Squad or Chapter environment. This model emphasizes high-touch collaboration across various technical disciplines:

Software Development Teams: Collaborating as a Reliability Consultant to help developers architect RHEL-friendly applications optimize container workloads and refine deployment strategies.

Quality Assurance (QA): Working together to integrate automated reliability testing and performance benchmarking into the CI/CD pipeline.

Cybersecurity: Partnering with the SecOps team to automate RHEL hardening manage SELinux policies and ensure rapid automated vulnerability patching.

Cloud Architects: Assisting in the design of hybrid-cloud solutions that ensure RHEL workloads move seamlessly between on-premise virtualization and public cloud platforms.

Collaboration Philosophy:

Our organization operates on a You Build It You Run It (with SRE support) philosophy. As a Senior SRE you are not a siloed gatekeeper; you are an enabler who empowers other teams through the creation of self-service automation and reliable platform abstractions.

5.3 Internal & External Stakeholders

Stakeholder Group

Nature of Interaction

Junior/Mid-level SREs

Mentorship code review of Ansible playbooks and guiding technical troubleshooting.

Product Owners

Negotiating SLOs and providing data-driven insights on how system reliability impacts user experience.

External Partners (e.g. Red Hat)

Managing technical support escalations and staying informed on upcoming RHEL roadmap features.

Position Specifications Attribute Details Position Title Senior Site Reliability Engineer (RHEL Specialist) Primary Location Remote Minimum Experience 5 Years in Systems Engineering DevOps or SRE roles Reporting Structure Reports to the Head of...

Position Specifications

Attribute

Details

Position Title

Senior Site Reliability Engineer (RHEL Specialist)

Primary Location

Remote

Minimum Experience

5 Years in Systems Engineering DevOps or SRE roles

Reporting Structure

Reports to the Head of Infrastructure & Platform Engineering

Language Requirements

Portuguese and English

1.3 Role Summary

Core Mission Statement:

1.4 Ideal Candidate Profile

The ideal candidate is a proactive problem-solver with a software-first approach to infrastructure. We are looking for an individual who:

Possesses a deep-seated expertise in the RHEL kernel system internals and performance tuning.

Views Ansible and Python as their primary tools for managing complexity at scale.

Demonstrates a proven track record of managing Docker and Kubernetes workloads in high-traffic production settings.

Is naturally curious and proactive often identifying and resolving system bottlenecks before they trigger an alert.

Thrives in a collaborative DevOps culture and is comfortable navigating the complexities of hybrid-cloud environments (AWS Azure or GCP).

2. Key Responsibilities

2.1 Automation & Infrastructure Orchestration

2.2 Development & Toil Reduction

System Integration: Write code to integrate infrastructure components with internal APIs monitoring tools and service management platforms to create seamless end-to-end automated workflows.

Kernel & OS Optimization: Leverage deep Linux knowledge to tune system parameters and develop automated checks for system health and performance bottlenecks.

2.3 CI/CD & Release Engineering

Pipeline Construction: Build and optimize robust CI/CD pipelines using Jenkins or GitLab CI to automate the testing and deployment of infrastructure code and application workloads.

Deployment Strategies: Implement modern deployment methodologies such as Blue-Green or Canary releases to minimize downtime and risk during system updates.

Artifact Management: Oversee the lifecycle of container images and software packages ensuring they are securely built scanned and distributed across the RHEL landscape.

Focus on Proactive Optimization:

2.4 Service Level Management & Observability

Capacity Planning: Utilize performance data to forecast future resource requirements and automate scaling actions across the hybrid-cloud estate.

2.5 Incident Response & Reliability Leadership

Post-Mortem Analysis: Conduct blameless post-mortems to identify the root causes of incidents and drive the implementation of automated preventative measures.

SRE Evangelism: Act as a champion for reliability engineering. Mentor other engineers on SRE best practices RHEL optimization and the cultural shift toward everything-as-code.

3. Required Technical Skills & Qualifications

Mandatory Requirement:

3.1 Enterprise Linux Mastery (RHEL Focus)

As our primary operating system environment deep expertise in Red Hat Enterprise Linux is non-negotiable. Requirements include:

System Administration: Expert knowledge of RHEL 7 8 and 9 internals including LVM systemd GRUB and advanced storage management.

Performance Tuning: Proficiency in kernel-level tuning using tools such as tuned top iostat vmstat and perf to optimize workloads.

Security Hardening: Advanced implementation of SELinux (policy management context troubleshooting) and system security protocols (Firewalld IPTables OpenSCAP).

Lifecycle Management: Proven experience with Red Hat Satellite including Capsule management Content Views and automated patch scheduling via Subscription Manager.

3.2 Automation & Configuration Management

Infrastructure as Code (IaC): Strong familiarity with managing stateful infrastructure via code focusing on idempotency and drift detection.

3.3 Programming & Scripting

The ability to read write and debug code is essential for eliminating toil and integrating systems.

Python: Strong proficiency for developing automation tools API integrations and monitoring scripts.

Bash: Expert shell scripting skills for system-level automation boot-time scripts and complex command-line tasks.

3.4 Cloud Containers & Virtualization

Category

Technical Requirements

Containerization

Hands-on experience with Docker/Podman and enterprise orchestration via Kubernetes or Red Hat OpenShift.

Virtualization

Experience managing RHEL workloads on VMware vSphere and/or KVM (libvirt/QEMU) environments.

Cloud Platforms

Practical experience managing Linux infrastructure in at least one major cloud provider: AWS Azure or GCP.

CI/CD Tooling

Proficiency in building and managing deployment pipelines using Jenkins or GitLab CI.

3.5 Certifications & Education

Education: Bachelors degree in Computer Science Information Technology or a related field (or equivalent professional experience).

Preferred Certifications: Red Hat Certified Engineer (RHCE) Red Hat Certified Architect (RHCA) or Certified Kubernetes Administrator (CKA) are highly desirable and will be given preference.

Professional Knowledge: Familiarity with SRE principles (based on the Google SRE workbook) and ITIL frameworks.

4. Desired Soft Skills

4.1 Proactive Problem-Solving & Strategic Thinking

4.2 Radical Ownership & Accountability

Sense of Ownership: You treat the production environment as your own taking personal pride in its uptime and performance.

4.3 Communication & Cross-Functional Collaboration

SREs sit at the intersection of Development and Operations. Success in this role requires:

Bridging the Gap: The ability to translate deep technical RHEL issues into business impact for stakeholders and actionable requirements for developers.

Active Listening: Working closely with software engineers to understand their deployment pain points and designing automation that solves real-world development hurdles.

Documentation: A commitment to maintaining clear concise and accurate runbooks and architectural diagrams to ensure knowledge is shared across the team.

The SRE Mindset:

We are looking for an individual who remains calm under fire. The ability to maintain focus and lead a technical team through a high-pressure P1 incident is just as important as the code you write.

4.4 Methodical Troubleshooting & Crisis Management

When faced with an unknown-unknown error you employ a scientific and methodical approach. You should be able to:

Isolate variables systematically to identify root causes in distributed systems.

Utilize observability data (logs metrics traces) to build a narrative of system failure.

Prioritize restoration of service over finding the perfect fix during an active outage.

4.5 Passion for Continuous Improvement & Learning

The technology landscapeespecially within the Red Hat and Cloud Native ecosystemsis constantly evolving. We value candidates who:

Are self-driven learners who keep pace with new Ansible modules RHEL features and SRE methodologies.

Actively seek out ways to improve not just the systems but also the teams internal processes and culture.

Enjoy mentoring junior engineers and fostering an environment of technical excellence.

5. Reporting Structure

5.1 Direct Reporting Line

Regular engagements with leadership will involve:

Weekly strategic updates on infrastructure health and Error Budget consumption.

Participation in quarterly planning sessions to align automation roadmaps with business objectives.

Direct involvement in Post-Mortem reviews with executive stakeholders for high-priority incidents.

5.2 Team Environment & The Squad Model

Software Development Teams: Collaborating as a Reliability Consultant to help developers architect RHEL-friendly applications optimize container workloads and refine deployment strategies.

Quality Assurance (QA): Working together to integrate automated reliability testing and performance benchmarking into the CI/CD pipeline.

Cybersecurity: Partnering with the SecOps team to automate RHEL hardening manage SELinux policies and ensure rapid automated vulnerability patching.

Cloud Architects: Assisting in the design of hybrid-cloud solutions that ensure RHEL workloads move seamlessly between on-premise virtualization and public cloud platforms.

Collaboration Philosophy:

5.3 Internal & External Stakeholders

Stakeholder Group

Nature of Interaction

Junior/Mid-level SREs

Mentorship code review of Ansible playbooks and guiding technical troubleshooting.

Product Owners

Negotiating SLOs and providing data-driven insights on how system reliability impacts user experience.

External Partners (e.g. Red Hat)

Managing technical support escalations and staying informed on upcoming RHEL roadmap features.

Key Skills

Apply Now

About Company

Future Fit

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Senior Site Reliability Engineer

Johannesburg - South Africa

Job Summary

Position Specifications

1.3 Role Summary

1.4 Ideal Candidate Profile

2. Key Responsibilities

2.1 Automation & Infrastructure Orchestration

2.2 Development & Toil Reduction

2.3 CI/CD & Release Engineering

2.4 Service Level Management & Observability

2.5 Incident Response & Reliability Leadership

3. Required Technical Skills & Qualifications

3.1 Enterprise Linux Mastery (RHEL Focus)

3.2 Automation & Configuration Management

3.3 Programming & Scripting

3.4 Cloud Containers & Virtualization

3.5 Certifications & Education

4. Desired Soft Skills

4.1 Proactive Problem-Solving & Strategic Thinking

4.2 Radical Ownership & Accountability

4.3 Communication & Cross-Functional Collaboration

4.4 Methodical Troubleshooting & Crisis Management

4.5 Passion for Continuous Improvement & Learning

5. Reporting Structure

5.1 Direct Reporting Line

5.2 Team Environment & The Squad Model

5.3 Internal & External Stakeholders

Position Specifications

1.3 Role Summary

1.4 Ideal Candidate Profile

2. Key Responsibilities

2.1 Automation & Infrastructure Orchestration

2.2 Development & Toil Reduction

2.3 CI/CD & Release Engineering

2.4 Service Level Management & Observability

2.5 Incident Response & Reliability Leadership

3. Required Technical Skills & Qualifications

3.1 Enterprise Linux Mastery (RHEL Focus)

3.2 Automation & Configuration Management

3.3 Programming & Scripting

3.4 Cloud Containers & Virtualization

3.5 Certifications & Education

4. Desired Soft Skills

4.1 Proactive Problem-Solving & Strategic Thinking

4.2 Radical Ownership & Accountability

4.3 Communication & Cross-Functional Collaboration

4.4 Methodical Troubleshooting & Crisis Management

4.5 Passion for Continuous Improvement & Learning

5. Reporting Structure

5.1 Direct Reporting Line

5.2 Team Environment & The Squad Model

5.3 Internal & External Stakeholders

Key Skills

About Company

Related Jobs