- Observability & Monitoring
- Enhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).
- Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.
- Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
- Automate dashboards and monitoring configurations for new services and infrastructure.
- Reliability & Resilience
- Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
- Run resiliency reviews and chaos testing to ensure systems can withstand failures.
- Partner with engineering teams to design for scalability high availability and disaster recovery.
- Incident Response & Postmortems
- Establish and refine incident management processes (on-call rotations escalation policies playbooks).
- Lead blameless postmortems turning incidents into learning opportunities and systemic improvements.
- Automation & Tooling
- Develop automation around monitoring logging and alerting configurations.
- Implement self-service tools for developers to easily onboard new services into observability pipelines.
- Optimize costs of observability tools while maintaining coverage and depth.
- Collaboration & Enablement
- Work closely with software engineers QA DevOps and product teams to embed observability into the development lifecycle.
- Mentor and guide teams on best practices for monitoring instrumentation and performance analysis.
- Foster a culture of proactive monitoring and continuous improvement.
- Proven experience as an SRE DevOps Engineer or in a similar role with a strong focus on observability and reliability.
- Hands-on experience with Grafana Prometheus Tempo Loki and OpenTelemetry (or similar observability stacks).
- Strong background in Linux systems networking and cloud platforms (AWS preferred).
- Proficiency in infrastructure-as-code tools (Terraform CloudFormation or similar).
- Solid programming/scripting skills (e.g. Python Go Bash).
- Experience setting up alerting and incident response workflows.
- Knowledge of CI/CD pipelines and modern software delivery practices.
- Strong analytical troubleshooting and problem-solving skills.
- Excellent communication and collaboration skills
- Experience with chaos engineering and resilience testing.
- Knowledge of distributed systems design and scaling.
- Familiarity with cost optimization strategies for observability and monitoring tools.
- Exposure to security monitoring and compliance requirements.
Required Experience:
IC
Observability & MonitoringEnhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and erro...
- Observability & Monitoring
- Enhance and expand our existing observability stack (Grafana Prometheus Tempo Loki).
- Implement robust alerting mechanisms for logs traces and metrics to improve incident detection and response.
- Establish Service Level Objectives (SLOs) Service Level Indicators (SLIs) and error budgets.
- Automate dashboards and monitoring configurations for new services and infrastructure.
- Reliability & Resilience
- Drive reliability improvements across products by identifying weak points and implementing fault-tolerant designs.
- Run resiliency reviews and chaos testing to ensure systems can withstand failures.
- Partner with engineering teams to design for scalability high availability and disaster recovery.
- Incident Response & Postmortems
- Establish and refine incident management processes (on-call rotations escalation policies playbooks).
- Lead blameless postmortems turning incidents into learning opportunities and systemic improvements.
- Automation & Tooling
- Develop automation around monitoring logging and alerting configurations.
- Implement self-service tools for developers to easily onboard new services into observability pipelines.
- Optimize costs of observability tools while maintaining coverage and depth.
- Collaboration & Enablement
- Work closely with software engineers QA DevOps and product teams to embed observability into the development lifecycle.
- Mentor and guide teams on best practices for monitoring instrumentation and performance analysis.
- Foster a culture of proactive monitoring and continuous improvement.
- Proven experience as an SRE DevOps Engineer or in a similar role with a strong focus on observability and reliability.
- Hands-on experience with Grafana Prometheus Tempo Loki and OpenTelemetry (or similar observability stacks).
- Strong background in Linux systems networking and cloud platforms (AWS preferred).
- Proficiency in infrastructure-as-code tools (Terraform CloudFormation or similar).
- Solid programming/scripting skills (e.g. Python Go Bash).
- Experience setting up alerting and incident response workflows.
- Knowledge of CI/CD pipelines and modern software delivery practices.
- Strong analytical troubleshooting and problem-solving skills.
- Excellent communication and collaboration skills
- Experience with chaos engineering and resilience testing.
- Knowledge of distributed systems design and scaling.
- Familiarity with cost optimization strategies for observability and monitoring tools.
- Exposure to security monitoring and compliance requirements.
Required Experience:
IC
View more
View less