Site ReliabilityDevOps Engineer (Contract) GautengHybrid ISB5303336

ISanqa Resourcing

Job Location:

Midrand - South Africa

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Our client is seeking an Expert Reliability Engineer to take complete ownership of the availability performance and scalability of Retail Sales Services.

This product is the backbone for dealers agents and NSCs globally facilitating vehicle configuration contract management and stock searches for all brands.

Key Highlights

Expert Level: 10 15 years of hands-on experience in DevOps Infrastructure and SRE.

Infrastructure Mastery: Deep expertise in AWS Terraform Kubernetes and CI/CD.

Operational Excellence: High-stakes responsibility for the reliability of global retail applications.

Work-Life Balance: High flexibility with a 1960-hour annual model and hybrid work options.

Position Details

Contract Duration: 01 June 2026 31 December 2028.
Location: Hybrid Midrand / Menlyn / Rosslyn / Home Office rotation.
Role Level: Expert.
Experience Required: 10 to 15 years.

Qualifications & Experience

Degree in Computer Science IT or equivalent practical experience.
10 to 15 years of hands-on experience in SRE/DevOps and cloud operations.
Relevant Certifications: AWS Solutions Architect
AWS DevOps Engineer
CKA/CKAD
Terraform Associate (Highly Advantageous).

Essential Skills & Technologies

Cloud Infrastructure & Operations
Deep hands-on experience with AWS (EC2 ECS/EKS RDS S3 VPC IAM CloudFront Lambda)
Proven ability to design build and operate scalable highly available cloud infrastructure
Strong experience with infrastructure-as-code (Terraform preferred CloudFormation)
Solid experience with containerization and orchestration (Docker Kubernetes)
Hands-on experience with cloud cost optimization - you know how to cut waste and right-size resources
Experience with capacity planning scaling strategies and performance tuning
CI/CD & Automation
Deep experience building and maintaining CI/CD pipelines (GitHub Actions preferred Jenkins)
Strong automation mindset - you automate everything that can be automated
Experience with build tools and artifact management (Maven Gradle GitHub Packages ECR)
Proficiency in scripting and tooling (Bash Python) to solve real operational problems
Experience with GitOps workflows and deployment automation
Security & Compliance
Hands-on experience implementing security controls in CI/CD pipelines (SAST DAST dependency scanning)
Knowledge of container security best practices (image scanning runtime security least-privilege)
Experience with IAM policies network security secrets management (AWS Secrets Manager Vault)
Familiarity with compliance frameworks and ability to translate security requirements into implementations
Pragmatic approach to security - you find the right balance between security and velocity
Monitoring Observability & Incident Response
Experience with monitoring and alerting solutions (Prometheus Grafana CloudWatch ELK/OpenSearch)
Ability to build meaningful dashboards that provide real operational insight
Strong troubleshooting and incident response skills - you stay calm under pressure and fix things fast
Experience with post-incident root cause analysis and driving corrective actions
Mindset & Way of Working
Pragmatic doer - you bias towards action and delivering results over endless discussions
Comfortable working in agile teams (Scrum/SAFe) alongside developers architects and product owners
Strong sense of ownership - if its broken its your problem until its fixed
Ability to prioritize ruthlessly - you know what matters and focus on high-impact work
Clear communicator who can explain technical decisions to non-technical stakeholders

Advantageous Skills:

Experience with ITSM processes (Incident Problem Change) and tools like ServiceNow
Experience with database operations and performance tuning (PostgreSQL MySQL MongoDB)
Knowledge of service mesh technologies (e.g. Istio)
Experience with chaos engineering or resilience testing
Familiarity with FinOps practices and cloud cost governance at scale
Experience with Technical Lifecycle Management (TLM) - upgrades deprecations migrations
Knowledge of AI-assisted DevOps tools and willingness to adopt AI4DevOps practices
Familiarity with Jira and Confluence for tracking and documentation
Experience in automotive or enterprise-scale environments

Key Responsibilities

Infrastructure & Cloud (35%)
Design build and maintain scalable secure and cost-efficient cloud infrastructure on AWS
Manage and evolve Kubernetes clusters including upgrades capacity planning and cluster health
Build and maintain infrastructure-as-code modules for repeatable auditable deployments
Drive cloud cost optimization - identify waste right-size resources implement savings plans
Ensure infrastructure meets non-functional requirements: performance scalability availability Disaster Recovery
CI/CD & Automation (25%)
Build operate and continuously improve CI/CD pipelines for fast safe and reliable delivery
Automate repetitive operational tasks and reduce toil through tooling and runbooks
Maintain and improve deployment automation - zero-touch deployments are the goal
Drive adoption of best practices across development teams
Own deployment runbooks and ensure they are up to date and tested
Security Implementation (20%)
Implement and maintain security scanning in CI/CD pipelines (SAST DAST container image scanning)
Harden container and cloud infrastructure security (network policies IAM secrets encryption)
Translate security audit findings into concrete technical actions and execute them
Drive vulnerability remediation - track prioritize and fix security issues with urgency
Ensure compliance with Group IT security standards and policies
Monitoring Reliability & Incident Response (15%)
Implement and own monitoring logging and alerting for proactive issue detection
Build dashboards that give real-time visibility into system health and performance
Lead incident response for infrastructure-related issues - diagnose fast fix fast
Conduct post-incident reviews and drive corrective actions to prevent recurrence
Continuously improve system reliability uptime and mean time to recovery (MTTR)
Technical Optimization & Lifecycle Management (5%)
Drive Technical Lifecycle Management (TLM) - plan and execute upgrades and migrations
Identify and implement technical optimizations across the stack
Contribute to technical strategy and roadmap for platform engineering
Actively use and promote AI4DevOps tools and practices where they add real value
What Does Success Look Like
Infrastructure Reliable scalable cost-optimized - no surprises
CI/CD Fast safe pipelines - developers ship with confidence
Security Vulnerabilities found early fixed fast - no excuses
Incidents Quick response thorough root cause things get better over time
Automation If you did it twice manually the third time its automated
Delivery You ship improvements continuously - not just plans but results
We dont need someone who writes documents about how things should be done. We need someone who rolls up their sleeves and makes things better - every single day.

NB:

South African citizens/residents are preferred.
Applicants with valid work permits will also be considered.
By applying you consent to being added to the database and receiving updates until you unsubscribe.
If you do not receive a response within 2 weeks please consider your application unsuccessful.

#iSanqa #Group #ReliabilityEngineer #SRE #DevOps #AWS #Kubernetes #Terraform #InfrastructureEngineering #GautengJobs #HybridWork #HiringNow

iSanqa is your trusted Level 2 BEE recruitment partner dedicated to continuous improvement in delivering exceptional service. Specializing in seamless placements for permanent staff temporary resources and efficient contract management and billing facilitation iSanqa Resourcing is powered by a team of professionals with an outstanding track record. With over 100 years of combined experience we are committed to evolving our practices to ensure ongoing excellence.

Our client is seeking an Expert Reliability Engineer to take complete ownership of the availability performance and scalability of Retail Sales Services. This product is the backbone for dealers agents and NSCs globally facilitating vehicle configuration contract management and stock searches for a...