Job description
Site Reliability Engineer
New York NY or Los Angeles CA(Hybrid)
10 years
We are looking for an experienced Site Reliability Engineer to join our team and ensure our systems remain reliable scalable and performant - especially during high-visibility high-traffic events. This role focuses on proactively preparing infrastructure and services for major events designing scalable solutions automating workflows and acting as a first responder during live incidents.
You will collaborate with engineering product and event operations teams to make sure our customers experience smooth uninterrupted service - even at massive scale.
What Youll Do
Serve as an on-call point of contact during live events
Monitor system health in real time and proactively mitigate performance issues
Rapidly diagnose and mitigate production issues under pressure
Lead post-event reviews analyzing performance data and incident timelines
Document learnings and recommendations to improve reliability at scale
What Were Looking For
Strong experience with cloud infrastructure (GCP AWS or Azure)
Proficiency in infrastructure-as-code (Terraform CloudFormation) and CI/CD pipelines
Deep knowledge of container orchestration (Kubernetes ECS) and distributed systems
Hands-on experience with observability tools (Prometheus Grafana New Relic)
Comfortable leading incident response calls and collaborating under pressure
Team player who collaborates across many cross-functional teams
A growth mindset with a passion for automation and continuous improvement
Skills
Mandatory Skills : Cloud infrastructure (GCP AWS or Azure) Terraform CloudFormation