Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via email6-8years
Not Disclosed
Salary Not Disclosed
1 Vacancy
As a Site Reliability Engineer you will:
Design and implement strategies that ensure high availability reliability and performance of critical systems and services.
Define and track key metrics including Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
Respond to incidents troubleshoot production issues and lead blameless postmortems to drive continuous improvement.
Create and maintain runbooks automate recovery processes and support seamless incident management.
Collaborate closely with developers and operations teams to build scalable supportable systems with best practices in CI/CD and infrastructure design.
Develop and maintain Infrastructure as Code (IaC) using tools like Terraform.
Set up observability stacks monitoring tools (preferably Splunk) alerts and dashboards to monitor system health and performance.
Automate operational tasks such as deployments scaling backups and failovers.
What You Bring to the Table:
6 to 8 years of IT experience with strong expertise in Site Reliability Engineering or related roles.
Proficiency in programming and scripting languages such as Python Bash or Ruby.
Strong background in cloud platforms especially Google Cloud Platform (GCP) and Azure including cloudnative services networking and security.
Deep understanding of Linux/Unix systems and Windows including performance tuning and system internals.
Solid handson experience with container technologies like Docker and orchestration tools like Kubernetes.
Familiarity with CI/CD and automation tools such as Jenkins GitHub Actions ArgoCD or equivalent.
Experience with observability stacks and a strong focus on monitoring logging and incident response.
Awareness of security and compliance practices including access control secrets management and audit logging.
You should possess the ability to:
Build tools to automate repetitive tasks and reduce manual intervention in operational processes.
Debug complex production issues effectively under pressure and take ownership during highstress incidents.
Coach and mentor junior team members fostering a collaborative and knowledgesharing environment.
Take initiative with a proactive problemsolving mindset and an analytical approach to system reliability.
What We Bring to the Table:
The opportunity to work in a fastpaced environment supporting critical systems in realworld production environments.
Exposure to cuttingedge technologies across cloud DevOps observability and automation.
A collaborative culture that encourages innovation continuous learning and technical leadership.
A supportive team environment with opportunities for growth and skill development in SRE best practices.
Full Time