Job Title: SRE Lead
Location: Atlanta GA (Day 1 hybrid)
Hybrid: Thursday to Wed work from office Alternate weeks
Onsite SRE Lead (10 yrs)
Core Skillset
Client Consulting:
- Work with team to define SRE maturity model observability strategy identify gaps and AWS reliability roadmap.
- Translate business SLAs into SLIs/SLOs/Error Budgets.
Architecture & Design:
- Lead and implement AWS serverless reliability architecture (multi-region failover self-healing).
- Define observability blueprints (logs metrics traces UX telemetry).
- Define cost optimized Data Observability and Resiliency solutions
Reliability & Resilience
- Design and implement fault-tolerant highly available AWS architectures.
- Experience in DynamoDB global tables RDS Failovers capacity planning
- Apply SRE principles: SLIs SLOs SLAs error budgets and toil reduction.
- Drive chaos engineering disaster recovery and capacity planning exercises.
Observability & Monitoring
- Experience in implementing end-to-end observability (logs metrics traces events).
- Build cost optimized unified dashboards custom metrics using Dynatrace Cloudwatch
- Experience in implementing Data Observability and Resiliency solutions
- Automate alerts anomaly detection and incident response workflows.
Automation & Infrastructure
- Develop automation and custom tooling using Python and .
- Build infrastructure as code using AWS CDK and CloudFormation.
- Implement self-healing and auto-remediation solutions with AWS serverless Services
Operations & Incident Management
- Implement AI/ML-driven automation.
- Collaborate with developers for shift-left observability and performance optimization.
- Guide and Lead adoption of automation proactive observability and self-healing systems.