Senior Site Reliability Engineer
Job Summary
- Define and maintain SLIs/SLOs monitor alignment and error budget usage
- Lead incident response and postmortems implement corrective measures
- Automate operations tasks via tooling (e.g. auto-remediation scaling rules)
- Build improve and maintain CI/CD pipelines canary deployments blue/green strategies
- Lead technical discussions with customers to align on reliability scalability and performance requirements
- Drive continuous platform improvements across the service lifecycle including architecture monitoring and operational processes
Implement and extend observability systems (metrics tracing log aggregation) - Optimize performance and cost by tuning cloud services autoscaling resource rightsizing
- Design deploy and operate containerized workloads using Docker and Kubernetes in production environments
- Collaborate with dev teams to integrate resilience patterns (circuit breakers bulkheading)
- Participate in architecture discussions around high availability disaster recovery
- Mentor mid and junior SREs; conduct reliability design reviews
- 58 years of experience in a reliability or operations role
- Cloud-agnostic certification: Terraform Associate Certified Kubernetes Administrator (CKA) or SRE Foundation
- Cloud provider certification: Professional-level certification in AWS (Solutions Architect) Azure (Solutions Architect Expert) GCP (Professional Cloud Architect) or Oracle Cloud (Architect Professional)
- Solid coding skills (Python Go or equivalent)
- Experience with IaC CI/CD pipelines and monitoring/observability stacks (Prometheus Grafana OpenTelemetry ELK)
- Comfortable with observability stacks (Prometheus Grafana OpenTelemetry ELK Jaeger)
- Experience working in distributed systems and production scale services
Nice-to-have Skills
- Exposure to multi-cloud data replication or cross-cloud networks
- Experience with chaos engineering or fault injection
Required Experience:
Senior IC
About Company
Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leadi ... View more