Recovery & Resiliency Manager (Infrastructure & Production)

Fort Mill, SC - USA

Monthly Salary: Not Disclosed

Posted on: 7 hours ago

Vacancies: 1 Vacancy

Job Summary

Job Title: Recovery & Resiliency Manager (Infrastructure & Production)
Work Mode: Onsite Fort Mill SC
Duration: 12 Months

Position Summary:
The Recovery Manager is responsible for ensuring the availability resilience and rapid recovery of critical infrastructure and production systems. This role bridging infrastructure engineering and production support to drive always-on capabilities. The manager will define test and maintain disaster recovery (DR) plans implement observability to proactively detect potential outages lead major incident resolution and conduct root cause analysis (RCA) to continuously improve service reliability.

Key Responsibilities:

1. Resiliency Planning & Disaster Recovery (DR)

Develop maintain and test comprehensive DR plans runbooks and Business Impact Analyses (BIA) for hybrid/cloud infrastructure.
Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets ensuring infrastructure design meets these requirements.
Lead regular disaster recovery tests simulation exercises and tabletop drills documenting outcomes and tracking remediation actions to closure.
Apply infrastructure-as-code (IaC) principles to automate recovery processes.

2. Production Support & Incident Management

Serve as a primary point of contact (POC) for major infrastructure incidents and high-profile disruptions.
Coordinate technical recovery efforts across cross-functional teams (network server storage database cloud) during incidents.
Lead Root Cause Analysis (RCA) and post-mortem investigations to identify and deploy countermeasures ensuring incidents do not recur.
Monitor production system performance and availability optimizing for high availability (HA).

3. Observability & Monitoring

Develop and promote a company-wide observability platform (e.g. Splunk Datadog Prometheus Grafana) for real-time monitoring of infrastructure health.
Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Implement proactive monitoring alerting and automated healing ensuring fast incident detection and recovery.

4. Leadership & Governance

Provide executive-level reporting on resilience posture test results and material risks.
Manage relationships with third-party vendors partners and service providers regarding service SLAs.
Ensure adherence to industry frameworks and compliance requirements (e.g. NIST ISO 22301 ITIL).

Required Skills & Qualifications

Experience: 10 years in IT disaster recovery business continuity production support or infrastructure operations.
Infrastructure: In-depth knowledge of on-premises (VMware SAN/NAS Linux/Windows) and Cloud (AWS Azure) environments.
Tools: Proficient in monitoring/observability tools (e.g. Datadog Splunk Dynatrace) and backup/replication technologies (e.g. Rubrik Cohesity Zerto).
Methodology: Strong understanding of ITIL DevOps practices and incident management frameworks.
Soft Skills: Excellent communication skills crisis management abilities and capability to work under pressure.

Job Title: Recovery & Resiliency Manager (Infrastructure & Production) Work Mode: Onsite Fort Mill SC Duration: 12 Months Position Summary: The Recovery Manager is responsible for ensuring the availability resilience and rapid recovery of critical infrastructure and production systems. This ro...