Site Reliability Engineer

New York City, NY - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

The job posting is outdated and position may be filled

Job Summary

Job description

Site Reliability Engineer

New York NY or Los Angeles CA(Hybrid)

10 years

We are looking for an experienced Site Reliability Engineer to join our team and ensure our systems remain reliable scalable and performant - especially during high-visibility high-traffic events. This role focuses on proactively preparing infrastructure and services for major events designing scalable solutions automating workflows and acting as a first responder during live incidents.

You will collaborate with engineering product and event operations teams to make sure our customers experience smooth uninterrupted service - even at massive scale.

What Youll Do

Serve as an on-call point of contact during live events

Monitor system health in real time and proactively mitigate performance issues

Rapidly diagnose and mitigate production issues under pressure

Lead post-event reviews analyzing performance data and incident timelines

Document learnings and recommendations to improve reliability at scale

What Were Looking For

Strong experience with cloud infrastructure (GCP AWS or Azure)

Proficiency in infrastructure-as-code (Terraform CloudFormation) and CI/CD pipelines

Deep knowledge of container orchestration (Kubernetes ECS) and distributed systems

Hands-on experience with observability tools (Prometheus Grafana New Relic)

Comfortable leading incident response calls and collaborating under pressure

Team player who collaborates across many cross-functional teams

A growth mindset with a passion for automation and continuous improvement

Skills

Mandatory Skills : Cloud infrastructure (GCP AWS or Azure) Terraform CloudFormation

Job description Site Reliability Engineer New York NY or Los Angeles CA(Hybrid) 10 years We are looking for an experienced Site Reliability Engineer to join our team and ensure our systems remain reliable scalable and performant - especially during high-visibility high-traffic events. This role f...