Site Reliability Engineer (SRE) Data Centers
Riyadh - Saudi Arabia
Job Summary
Site Reliability Engineer (SRE) - Data Centres
Location: Riyadh / NEOM Kingdom of Saudi Arabia (KSA)
Sector: Hyperscale Cloud Operations & AI Infrastructure
Role Type: Full-Time / Permanent
Role Objective
As the Kingdom moves toward becoming a global hub for AI and Cloud technology the stability of our physical and virtual infrastructure is paramount. We are looking for a Site Reliability Engineer (SRE) to apply a software engineering mindset to system administration. You will be the bridge between our massive physical hardware footprint and the automated software layers that power them ensuring our services are fast reliable and scalable.
Key Responsibilities
1. Infrastructure Automation & IaC
Replace manual operational tasks with automated workflows using Python Go or Bash.
Deploy and manage infrastructure using Terraform Ansible or Pulumi (Infrastructure as Code).
Maintain and scale Kubernetes (K8s) clusters across multiple availability zones within KSA.
2. Monitoring & Incident Management
Design and implement robust observability stacks using Prometheus Grafana and ELK.
Participate in a 24/7 on-call rotation to manage high-severity incidents conducting thorough Post-Mortem (RCA) reports to ensure issues do not recur.
Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
3. Performance Tuning & Scaling
Optimize the interaction between high-performance AI workloads (GPUs) and the underlying Linux kernel/network stack.
Collaborate with hardware teams to ensure efficient thermal and power consumption during peak loads.
4. Security & Sovereignty
Implement security protocols in alignment with SDAIA (Saudi Data & AI Authority) and NCA (National Cybersecurity Authority) regulations.
Ensure data residency requirements are met within the Kingdoms borders.
Required Qualifications & Skills
Education: Bachelors degree in Computer Science Software Engineering or a related field.
Technical Stack: * Strong proficiency in Linux/Unix administration.
Experience with containerization (Docker/Kubernetes).
Hands-on experience with at least one major cloud provider (AWS GCP Azure or Oracle Cloud).
Experience: 3 years in an SRE DevOps or Systems Engineering role preferably in a high-scale environment.
Local Standing: Valid registration with the Saudi Council of Engineers (SCE).
Preferred Attributes
Experience with distributed storage systems (Ceph GlusterFS).
Knowledge of networking protocols (BGP OSPF) in a data center context.
Familiarity with the unique infrastructure challenges of NEOMs cognitive city framework.
Application Notice
STRAGO an equal employment opportunity employer is recruiting on behalf of our client. If your application matches the required profile you will be contacted to go ahead with the selection process.