We are seeking an experienced Observability & Monitoring Engineer (SRE/DevOps) to lead reliability observability and performance efforts for our most critical applications. This role bridges development operations and product ensuring our systems are robust scalable and drive superior business outcomes. The Senior Observability & Monitoring Engineer will design and optimize monitoring strategies automate operational tasks and serve as a technical mentor for reliability within the R&D organization.
Key Responsibilities:
- Architect implement and maintain advanced monitoring logging and alerting solutions using Datadog (mandatory) covering infrastructure application and business-level metrics.
- Lead and optimize reliability performance and scalability efforts for PostgreSQL Redis SQS K8s and cloud-native environments.
- Design build and maintain automations for operational tasks deployments and remediations (Infrastructure-as-Code CI/CD self-healing workflows).
- Mentor engineers on reliability engineering best practices monitoring usage and troubleshooting methodologies.
- Lead knowledge sharing by producing high-quality documentation technical presentations and internal training.
- Perform capacity planning performance tuning and proactively address potential bottlenecks or scaling issues.
- Stay current with SRE DevOps and cloud trends; evaluate and recommend new tools and approaches for continuous improvement.
#LI-Hybrid
#LI-CR1
Qualifications :
- 7 years of experience in SRE DevOps or production engineering roles supporting large-scale distributed systems.
- Expertise architecting and operating monitoring tracing and alerting with Datadog (including custom metrics dashboards and advanced alerting techniques).
- Experience with additional monitoring/observability platforms (e.g. Prometheus Grafana ELK stack).
- Hands-on knowledge of PostgreSQL Redis SQS and Kubernetes (deployment troubleshooting scaling and performance optimization).
- Advanced scripting/programming skills with Python Bash or another relevant language.
- Track record of designing and implementing automated solutions (Infrastructure-as-Code CI/CD pipelines auto-remediation).
- Strong communication skills including technical writing documentation and presentation to diverse technical audiences.
- Experience working closely with development product and architecture teams to embed reliability from the design phase.
- Fluent technical English.
Preferred Qualifications:
- Strong familiarity with SaaS microservices architectures and security best practices.
- Cloud certifications (e.g. AWS Certified Solutions Architect GCP Professional Cloud Engineer) are a plus.
- Deep experience with chaos engineering performance/load testing and continuous improvement frameworks.
- Demonstrated ability to mentor engineers promote reliability culture and foster knowledge sharing.
Remote Work :
No
Employment Type :
Full-time