SRE Lead

Kuala Lumpur - Malaysia

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

Description

Our Platforms Team is at the forefront of innovation creating technology solutions that empower multiple business lines across the organization. We are looking for a senior SRE to be supporting our applications deployed across the globe.

As an SRE practitioner you will work to improve the reliability availability and performance of systems and services. You will collaborate with development and operations teams to design implement and maintain scalable and resilient infrastructure. Your role will involve automating processes monitoring systems and responding to incidents to ensure seamless user experiences.

Key Responsibilities:

System Reliability and Performance:

Design build and maintain scalable and reliable systems.
Monitor system performance and proactively address bottlenecks or issues.
Implement strategies to improve system uptime and reduce downtime.

Automation and Tooling:

Develop and maintain automation tools for deployment monitoring and incident response.
Create scripts and workflows to reduce manual intervention and improve efficiency.

Incident Management:

Respond to system outages and incidents performing root cause analysis and implementing fixes.
Develop and maintain runbooks and documentation for incident response.

Monitoring and Observability:

Set up and maintain monitoring tools to track system health and performance.
Define and measure Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Collaboration and Communication:

Work closely with development teams to ensure systems are designed with reliability in mind.
Collaborate with operations teams to improve deployment processes and system management.

Capacity Planning and Scaling:

Analyze system usage and plan for future capacity needs.
Implement solutions to handle traffic spikes and ensure scalability.

Continuous Improvement:

Identify areas for improvement in system architecture and processes.
Advocate for best practices in reliability engineering and DevOps.

Qualifications

Strong knowledge of Linux/Unix systems and networking.
Proficiency in programming languages such as Python Ansible Java. Etc.
Experience with cloud platforms (e.g. Azure AWS).
Familiarity with containerization and orchestration tools (e.g. Docker Kubernetes).
Expertise in monitoring and observability tools (e.g. App Dynamics App Insights Dynatrace Grafana ELK stack).
Understanding of CI/CD pipelines and automation frameworks.
Problem-solving skills and ability to perform root cause analysis.
Excellent communication and collaboration skills.
Experience with distributed systems and microservices architecture.
Knowledge of database systems (SQL and NoSQL).
Familiarity with incident management frameworks (e.g. ITIL SRE best practices).
Certifications in cloud technologies or DevOps tools.
Analytical mindset with a focus on reliability and scalability.
Passion for automation and reducing manual work.
Ability to work under pressure and handle critical incidents effectively.
Commitment to continuous learning and staying updated on industry trends.

DescriptionOur Platforms Team is at the forefront of innovation creating technology solutions that empower multiple business lines across the organization. We are looking for a senior SRE to be supporting our applications deployed across the globe.As an SRE practitioner you will work to improve the ...

Description

Key Responsibilities:

System Reliability and Performance:

Design build and maintain scalable and reliable systems.
Monitor system performance and proactively address bottlenecks or issues.
Implement strategies to improve system uptime and reduce downtime.

Automation and Tooling:

Develop and maintain automation tools for deployment monitoring and incident response.
Create scripts and workflows to reduce manual intervention and improve efficiency.

Incident Management:

Respond to system outages and incidents performing root cause analysis and implementing fixes.
Develop and maintain runbooks and documentation for incident response.

Monitoring and Observability:

Set up and maintain monitoring tools to track system health and performance.
Define and measure Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Collaboration and Communication:

Work closely with development teams to ensure systems are designed with reliability in mind.
Collaborate with operations teams to improve deployment processes and system management.

Capacity Planning and Scaling:

Analyze system usage and plan for future capacity needs.
Implement solutions to handle traffic spikes and ensure scalability.

Continuous Improvement:

Identify areas for improvement in system architecture and processes.
Advocate for best practices in reliability engineering and DevOps.

Qualifications

Strong knowledge of Linux/Unix systems and networking.
Proficiency in programming languages such as Python Ansible Java. Etc.
Experience with cloud platforms (e.g. Azure AWS).
Familiarity with containerization and orchestration tools (e.g. Docker Kubernetes).
Expertise in monitoring and observability tools (e.g. App Dynamics App Insights Dynatrace Grafana ELK stack).
Understanding of CI/CD pipelines and automation frameworks.
Problem-solving skills and ability to perform root cause analysis.
Excellent communication and collaboration skills.
Experience with distributed systems and microservices architecture.
Knowledge of database systems (SQL and NoSQL).
Familiarity with incident management frameworks (e.g. ITIL SRE best practices).
Certifications in cloud technologies or DevOps tools.
Analytical mindset with a focus on reliability and scalability.
Passion for automation and reducing manual work.
Ability to work under pressure and handle critical incidents effectively.
Commitment to continuous learning and staying updated on industry trends.

Key Skills

Administrative Skills
Facilities Management
Biotechnology
Creative Production
Design And Estimation
Architecture

Apply Now

About Company

Chubb

Chubb Life has been in Egypt since 2003. Chubb Life is the international life insurance division of Chubb - the world's largest publicly traded property and casualty insurance company by market capitalization.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click