Sr. Production Support Engineer
New York City, NY - USA
Job Summary
This is a remote position.
Reporting to Manager Production Support & Service Reliability handles incident triage issue reproduction environment support release support integration failures defect coordination and pod escalation support. This role is a key onshore partner to Product and Engineering during live incidents high-priority releases and ambiguous production issues.
Key Responsibilities
Incident Response & Triage
Handle incident triage issue reproduction support diagnostics and escalation management across production systems and workflows.
Investigate application integration configuration and environment issues with a focus on restoring service and clarifying root cause.
Support high-priority incidents that require close coordination with engineering product or business stakeholders.
Pod & Release Support
Act as a close day-to-day support partner to pods during live releases stabilization periods and production issue follow-up.
Support release-watch activities and confirm production-readiness checks are executed consistently.
Help identify support risks before they become material incidents.
Problem Management
Document incidents clearly coordinate handoffs to Engineering and help ensure issues are tracked through resolution.
Improve runbooks issue patterns and support evidence to increase repeatability and speed of response.
Partner with QA and Engineering to reduce escaped defects and recurring support pain points.
Business-Aware Support
Translate technical issues into clear business impact statements for finance-sensitive workflows.
Help distinguish true defects from data issues process exceptions or user enablement gaps.
Escalate material issues quickly and clearly.
Requirements
Required Qualifications
5 years of production support application support systems support or software operations experience.
Experience with modern application environments APIs integrations and workflow-driven platforms.
Strong troubleshooting skills and comfort reading logs tracing workflows and reproducing issues.
Ability to work closely with engineers product managers QA and business stakeholders.
Comfort handling ambiguous or high-pressure incidents in a disciplined way.
Bachelors degree preferred.
You Are
Methodical calm and accountable.
Strong at translating technical findings into business-relevant language.
Comfortable handling ambiguity and escalation-heavy support work.
A dependable partner during live incidents and releases.
Benefits
Salary plus performance-based bonus.
Actual compensation packages are determined by evaluating a wide array of factors unique to each candidate including but not limited to skill set years and depth of experience education certifications cost of labor and internal equity.
Required Skills:
5 years of experience in Production Support Application Support or Site Reliability Engineering (SRE) Strong experience supporting systems in AWS and/or Azure environments Experience troubleshooting data pipelines ETL/ELT processes and data-related issues Strong SQL skills for data investigation and validation Experience with monitoring and observability tools (e.g. Datadog Splunk New Relic CloudWatch Azure Monitor) Experience with API troubleshooting and microservices-based architectures Familiarity with incident management and ticketing systems (e.g. ServiceNow Jira) Basic scripting or programming experience (e.g. Python Bash or PowerShell)