Site Reliability Engineer

SeamlessHR

Not Interested
Bookmark
Report This Job

profile Job Location:

Lagos - Nigeria

profile Monthly Salary: Not Disclosed
Posted on: 6 days ago
Vacancies: 1 Vacancy

Job Summary

  • Observability & Monitoring
    • Enhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).
    • Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.
    • Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
    • Automate dashboards and monitoring configurations for new services and infrastructure.
  • Reliability & Resilience
    • Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
    • Run resiliency reviews and chaos testing to ensure systems can withstand failures.
    • Partner with engineering teams to design for scalability high availability and disaster recovery.
  • Incident Response & Postmortems
    • Establish and refine incident management processes (on-call rotations escalation policies playbooks).
    • Lead blameless postmortems turning incidents into learning opportunities and systemic improvements.
  • Automation & Tooling
    • Develop automation around monitoring logging and alerting configurations.
    • Implement self-service tools for developers to easily onboard new services into observability pipelines.
    • Optimize costs of observability tools while maintaining coverage and depth.
  • Collaboration & Enablement
    • Work closely with software engineers QA DevOps and product teams to embed observability into the development lifecycle.
    • Mentor and guide teams on best practices for monitoring instrumentation and performance analysis.
    • Foster a culture of proactive monitoring and continuous improvement.


  • Proven experience as an SRE DevOps Engineer or in a similar role with a strong focus on observability and reliability.
  • Hands-on experience with Grafana Prometheus Tempo Loki and OpenTelemetry (or similar observability stacks).
  • Strong background in Linux systems networking and cloud platforms (AWS preferred).
  • Proficiency in infrastructure-as-code tools (Terraform CloudFormation or similar).
  • Solid programming/scripting skills (e.g. Python Go Bash).
  • Experience setting up alerting and incident response workflows.
  • Knowledge of CI/CD pipelines and modern software delivery practices.
  • Strong analytical troubleshooting and problem-solving skills.
  • Excellent communication and collaboration skills
  • Experience with chaos engineering and resilience testing.
  • Knowledge of distributed systems design and scaling.
  • Familiarity with cost optimization strategies for observability and monitoring tools.
  • Exposure to security monitoring and compliance requirements.



Required Experience:

IC

Observability & MonitoringEnhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and erro...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Company Logo

SeamlessHR.com Limited is an equal opportunity employer and we offer employment based on merit. We do not discriminate on the grounds of age, gender, race, disability, sexual orientation, and religion/belief. Our work environment is fun, fast-paced, dynamic and collaborative with a te ... View more

View Profile View Profile