Site Reliability Engineer (SRE) Data Centers

STRAGO

Job Location:

Riyadh - Saudi Arabia

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Site Reliability Engineer (SRE) - Data Centres

Location: Riyadh / NEOM Kingdom of Saudi Arabia (KSA)

Sector: Hyperscale Cloud Operations & AI Infrastructure

Role Type: Full-Time / Permanent

Role Objective

As the Kingdom moves toward becoming a global hub for AI and Cloud technology the stability of our physical and virtual infrastructure is paramount. We are looking for a Site Reliability Engineer (SRE) to apply a software engineering mindset to system administration. You will be the bridge between our massive physical hardware footprint and the automated software layers that power them ensuring our services are fast reliable and scalable.

Key Responsibilities

1. Infrastructure Automation & IaC

Replace manual operational tasks with automated workflows using Python Go or Bash.
Deploy and manage infrastructure using Terraform Ansible or Pulumi (Infrastructure as Code).
Maintain and scale Kubernetes (K8s) clusters across multiple availability zones within KSA.

2. Monitoring & Incident Management

Design and implement robust observability stacks using Prometheus Grafana and ELK.
Participate in a 24/7 on-call rotation to manage high-severity incidents conducting thorough Post-Mortem (RCA) reports to ensure issues do not recur.
Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

3. Performance Tuning & Scaling

Optimize the interaction between high-performance AI workloads (GPUs) and the underlying Linux kernel/network stack.
Collaborate with hardware teams to ensure efficient thermal and power consumption during peak loads.

4. Security & Sovereignty

Implement security protocols in alignment with SDAIA (Saudi Data & AI Authority) and NCA (National Cybersecurity Authority) regulations.
Ensure data residency requirements are met within the Kingdoms borders.

Required Qualifications & Skills

Education: Bachelors degree in Computer Science Software Engineering or a related field.
Technical Stack: * Strong proficiency in Linux/Unix administration.
- Experience with containerization (Docker/Kubernetes).
- Hands-on experience with at least one major cloud provider (AWS GCP Azure or Oracle Cloud).
Experience: 3 years in an SRE DevOps or Systems Engineering role preferably in a high-scale environment.
Local Standing: Valid registration with the Saudi Council of Engineers (SCE).

Preferred Attributes

Experience with distributed storage systems (Ceph GlusterFS).
Knowledge of networking protocols (BGP OSPF) in a data center context.
Familiarity with the unique infrastructure challenges of NEOMs cognitive city framework.

Application Notice

STRAGO an equal employment opportunity employer is recruiting on behalf of our client. If your application matches the required profile you will be contacted to go ahead with the selection process.

Site Reliability Engineer (SRE) - Data CentresLocation: Riyadh / NEOM Kingdom of Saudi Arabia (KSA)Sector: Hyperscale Cloud Operations & AI InfrastructureRole Type: Full-Time / PermanentRole ObjectiveAs the Kingdom moves toward becoming a global hub for AI and Cloud technology the stability of our p...

Site Reliability Engineer (SRE) - Data Centres

Location: Riyadh / NEOM Kingdom of Saudi Arabia (KSA)

Sector: Hyperscale Cloud Operations & AI Infrastructure

Role Type: Full-Time / Permanent

Role Objective

Key Responsibilities

1. Infrastructure Automation & IaC

Replace manual operational tasks with automated workflows using Python Go or Bash.
Deploy and manage infrastructure using Terraform Ansible or Pulumi (Infrastructure as Code).
Maintain and scale Kubernetes (K8s) clusters across multiple availability zones within KSA.

2. Monitoring & Incident Management

Design and implement robust observability stacks using Prometheus Grafana and ELK.
Participate in a 24/7 on-call rotation to manage high-severity incidents conducting thorough Post-Mortem (RCA) reports to ensure issues do not recur.
Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

3. Performance Tuning & Scaling

Optimize the interaction between high-performance AI workloads (GPUs) and the underlying Linux kernel/network stack.
Collaborate with hardware teams to ensure efficient thermal and power consumption during peak loads.

4. Security & Sovereignty

Implement security protocols in alignment with SDAIA (Saudi Data & AI Authority) and NCA (National Cybersecurity Authority) regulations.
Ensure data residency requirements are met within the Kingdoms borders.

Required Qualifications & Skills

Education: Bachelors degree in Computer Science Software Engineering or a related field.
Technical Stack: * Strong proficiency in Linux/Unix administration.
- Experience with containerization (Docker/Kubernetes).
- Hands-on experience with at least one major cloud provider (AWS GCP Azure or Oracle Cloud).
Experience: 3 years in an SRE DevOps or Systems Engineering role preferably in a high-scale environment.
Local Standing: Valid registration with the Saudi Council of Engineers (SCE).

Preferred Attributes

Experience with distributed storage systems (Ceph GlusterFS).
Knowledge of networking protocols (BGP OSPF) in a data center context.
Familiarity with the unique infrastructure challenges of NEOMs cognitive city framework.

Application Notice

Apply Now

About Company

STRAGO

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Site Reliability Engineer (SRE) Data Centers

Riyadh - Saudi Arabia

Job Summary

Site Reliability Engineer (SRE) - Data Centres

Role Objective

Key Responsibilities

1. Infrastructure Automation & IaC

2. Monitoring & Incident Management

3. Performance Tuning & Scaling

4. Security & Sovereignty

Required Qualifications & Skills

Preferred Attributes

Application Notice

Site Reliability Engineer (SRE) - Data Centres

Role Objective

Key Responsibilities

1. Infrastructure Automation & IaC

2. Monitoring & Incident Management

3. Performance Tuning & Scaling

4. Security & Sovereignty

Required Qualifications & Skills

Preferred Attributes

Application Notice

About Company

Related Jobs