Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailWe are seeking a skilled and passionate Engineer to join our team to build and operate a Whole-of-Government (WoG) runtime platform.
As a Site Reliability Engineer you will be responsible for designing and operating GitLab AWSand Kubernetes-based infrastructure and solutions that power our platform to ensure the stability scalability and performance of our runtime platform.
Responsibilities:
As a Site Reliability Engineer you will be responsible for:
Toil Reduction & Automation
Identify repetitive tasks and develop automation via CI/CD pipelines ensuring integration with cross-functional teams to reduce manual intervention and improve operational efficiency.
Observability & System Health
Implement comprehensive observability solutions (logs metrics traces alerts) around the four Golden Signals (latency traffic errors saturation) and build automation for proactive system health assessments and self-remediation.
Production Support & Incident Management
Participate in on-call rotations promptly respond to incidents to minimize MTTR and conduct thorough post-incident reviews to implement preventive measures and improve system resilience.
Security & Compliance
Design and implement solutions that are secure and compliant by collaborating with dedicated security teams conducting regular audits and integrating advanced vulnerability scanning tools.
Maintenance Optimisation & Performance
Identify and resolve performance bottlenecks and operational issues define and track KPIs (e.g. MTTR system uptime cost efficiency) and drive ongoing optimisation efforts.
Strategic Customer Engagement
Act as a technical advisor for tenants guiding them on containerization and best practices for cloud-native deployments and participating in strategic initiatives to enhance platform scalability and performance.
Knowledge Sharing & Documentation
Develop and maintain detailed playbooks runbooks and documentation to facilitate team-wide knowledge sharing streamline incident response and ensure that critical processes are well understood across the team.
Continuous Learning & Innovation
Stay current with the latest AWS Kubernetes and industry developments and proactively recommend improvements and innovative solutions to maintain a competitive and reliable platform.
Requirements:
Bachelors degree or Diploma in Computer Science Engineering or a related field (or equivalent experience).
Proven experience as a Site Reliability Engineer or similar role with a strong background in containerization orchestration and cloud-native technologies.
Proven ability to troubleshoot and resolve complex technical issues in containerized applications.
Demonstrated experience with incident management including post-incident reviews and continuous improvement.
Strong documentation skills and experience in knowledge sharing across teams.
Deep understanding of AWS Kubernetes (including AWS EKS) and operational best practices with familiarity in multi-cloud or hybrid environments.
Solid grasp of networking security and storage in both AWS and Kubernetes contexts.
Experience integrating Kubernetes with AWS cloud technologies (e.g. Secrets Manager Load Balancers) and using infrastructure-as-code (Terraform or similar).
Hands-on experience with containerization tools (Kubernetes Kustomize Helm) and automation scripting (Go Python Bash or equivalent).
Ability to write and maintain automated tests or conduct thorough manual testing for automation scripts ensuring the reliability and effectiveness of automated solutions.
Familiarity with CI/CD tools (GitLab CI/CD ArgoCD) and version control systems (Git).
Experience with observability/monitoring tools (Prometheus Grafana ELK Stack) and defining SLOs and Error Budgets.
Certifications such as Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD) are a plus.
Experience with developing Kubernetes operators using Go service mesh technologies and Chaos Engineering is a plus.
Soft skills:
Proactive in identifying problems and recommending strategic solutions.
Excellent problem-solving skills with a robust analytical mindset.
Clear concise and effective communication skills; adept at collaborating across crossfunctional teams including development security and customer-facing groups.
Ability to remain calm and effective under pressure especially during incident response.
Adaptability to rapid change with a continuous learning mindset sharing knowledge to foster team growth.
Customer-focused with the ability to translate technical insights into understandable actionable guidance.
Leadership and mentoring capabilities contributing to the development of a resilient and collaborative team environment are a plus.
Full Time