Production Support & Service Manager
New York County, NY - USA
Job Summary
This is a remote position.
Reporting to Head of Engineering owns the production support operating model incident management service levels release-watch support escalation governance and overall live-service quality across pods. This role ensures Production Support is proactive disciplined and tightly connected to Engineering Product QA and business stakeholders.
Key Responsibilities
Service Reliability Leadership
Define the production support operating model incident lifecycle severity framework and support expectations across pods.
Establish service levels escalation paths release-watch routines and communication standards for production issues.
Create visibility into live-service health incident trends and recurring support risks.
Team & Process Leadership
Lead production support engineers AI operations analysts and the finance-domain support lead.
Build repeatable support routines runbook discipline and issue triage practices that scale across products and workflows.
Partner with Product Engineering and QA to improve supportability and reduce recurring production issues.
Incident & Stakeholder Management
Own incident governance for material issues including severity calls stakeholder updates escalation management and stabilization plans.
Ensure business and technical stakeholders have clear visibility into impact next steps and resolution progress.
Drive post-incident review practices that improve resilience and reduce repeat failures.
Ensure finance-sensitive workflows receive the right level of production support issue classification and escalation handling.
Oversee support patterns for AI-enabled workflows including degraded outputs fallback scenarios trust issues and human-review triggers.
Work with business and technical teams to distinguish software defects from data process training or model-behavior issues.
Requirements
Required Qualifications
8 years of production support application support service reliability or engineering operations experience including team leadership.
Strong knowledge of incident management service operations support processes release support and escalation discipline.
Experience working closely with engineering and product teams in modern software delivery environments.
Ability to communicate clearly with both technical and business stakeholders during high-pressure situations.
Comfort operating in finance-sensitive workflow-heavy or business-critical application environments.
Bachelors degree preferred.
You Are
Structured pragmatic and highly credible.
Calm under pressure and comfortable making judgment calls with incomplete information.
A builder of reliable operating processes not just a responder to tickets.
Focused on trust transparency and service quality.
Benefits
Salary plus performance-based bonus.
Actual compensation packages are determined by evaluating a wide array of factors unique to each candidate including but not limited to skill set years and depth of experience education certifications cost of labor and internal equity.
Required Skills:
7 years of experience in Production Support Application Support or IT Service Management Experience supporting systems in cloud environments (AWS and/or Azure) Strong understanding of incident problem and change management processes Experience supporting data platforms ETL pipelines and reporting tools Familiarity with monitoring and observability tools (e.g. Datadog Splunk New Relic or similar) Strong experience with SQL and data troubleshooting Experience managing or leading support teams or service operations Strong communication skills with the ability to interact with both technical and business stakeholders