Site Reliability Engineer

Lagos - Nigeria

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Observability & Monitoring
- Enhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).
- Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.
- Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
- Automate dashboards and monitoring configurations for new services and infrastructure.
Reliability & Resilience
- Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
- Run resiliency reviews and chaos testing to ensure systems can withstand failures.
- Partner with engineering teams to design for scalability high availability and disaster recovery.
Incident Response & Postmortems
- Establish and refine incident management processes (on-call rotations escalation policies playbooks).
- Lead blameless postmortems turning incidents into learning opportunities and systemic improvements.
Automation & Tooling
- Develop automation around monitoring logging and alerting configurations.
- Implement self-service tools for developers to easily onboard new services into observability pipelines.
- Optimize costs of observability tools while maintaining coverage and depth.
Collaboration & Enablement
- Work closely with software engineers QA DevOps and product teams to embed observability into the development lifecycle.
- Mentor and guide teams on best practices for monitoring instrumentation and performance analysis.
- Foster a culture of proactive monitoring and continuous improvement.

Proven experience as an SRE DevOps Engineer or in a similar role with a strong focus on observability and reliability.
Hands-on experience with Grafana Prometheus Tempo Loki and OpenTelemetry (or similar observability stacks).
Strong background in Linux systems networking and cloud platforms (AWS preferred).
Proficiency in infrastructure-as-code tools (Terraform CloudFormation or similar).
Solid programming/scripting skills (e.g. Python Go Bash).
Experience setting up alerting and incident response workflows.
Knowledge of CI/CD pipelines and modern software delivery practices.
Strong analytical troubleshooting and problem-solving skills.
Excellent communication and collaboration skills
Experience with chaos engineering and resilience testing.
Knowledge of distributed systems design and scaling.
Familiarity with cost optimization strategies for observability and monitoring tools.
Exposure to security monitoring and compliance requirements.

Required Experience:

Observability & MonitoringEnhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and erro...

Observability & Monitoring
- Enhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).
- Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.
- Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
- Automate dashboards and monitoring configurations for new services and infrastructure.
Reliability & Resilience
- Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
- Run resiliency reviews and chaos testing to ensure systems can withstand failures.
- Partner with engineering teams to design for scalability high availability and disaster recovery.
Incident Response & Postmortems
- Establish and refine incident management processes (on-call rotations escalation policies playbooks).
- Lead blameless postmortems turning incidents into learning opportunities and systemic improvements.
Automation & Tooling
- Develop automation around monitoring logging and alerting configurations.
- Implement self-service tools for developers to easily onboard new services into observability pipelines.
- Optimize costs of observability tools while maintaining coverage and depth.
Collaboration & Enablement
- Work closely with software engineers QA DevOps and product teams to embed observability into the development lifecycle.
- Mentor and guide teams on best practices for monitoring instrumentation and performance analysis.
- Foster a culture of proactive monitoring and continuous improvement.

Proven experience as an SRE DevOps Engineer or in a similar role with a strong focus on observability and reliability.
Hands-on experience with Grafana Prometheus Tempo Loki and OpenTelemetry (or similar observability stacks).
Strong background in Linux systems networking and cloud platforms (AWS preferred).
Proficiency in infrastructure-as-code tools (Terraform CloudFormation or similar).
Solid programming/scripting skills (e.g. Python Go Bash).
Experience setting up alerting and incident response workflows.
Knowledge of CI/CD pipelines and modern software delivery practices.
Strong analytical troubleshooting and problem-solving skills.
Excellent communication and collaboration skills
Experience with chaos engineering and resilience testing.
Knowledge of distributed systems design and scaling.
Familiarity with cost optimization strategies for observability and monitoring tools.
Exposure to security monitoring and compliance requirements.

Required Experience:

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

SeamlessHR

SeamlessHR.com Limited is an equal opportunity employer and we offer employment based on merit. We do not discriminate on the grounds of age, gender, race, disability, sexual orientation, and religion/belief. Our work environment is fun, fast-paced, dynamic and collaborative with a te ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click