Senior Reliability Engineer - Customer Data Platform
CDP MISSION: Our mission is to be the authoritative source of truth for customer data - delivering timely high-quality data at scale to power the contextual experiences that drive the growth of this company. Every customer profile must be accurate trusted and available when it matters across every touchpoint for the entire US adult population.
Job Overview
We are seeking a Senior Reliability Engineer to own production excellence for our Customer Data Platform (CDP) - the authoritative source of truth for customer data across the entire US adult population.
An authoritative platform is only authoritative if it is available secure and timely. This role ensures exactly that: high availability operational resilience and compliance for the critical data systems that power customer experiences across every touchpoint. You will lead 24x7 production support incident management platform governance and security compliance - ensuring CDP remains the trusted foundation the business depends on.
You will act as the bridge between engineering platform security and compliance teams driving the operational discipline that keeps CDP resilient secure and audit-ready at all times.
Job Responsibilities - KTLO Leadership and Production Support
Lead KTLO operations including 24x7 monitoring incident management and on-call processes - understanding that CDP downtime directly impacts customer experiences and business decisions
Oversee production support for data pipelines APIs and platform services across Azure and Databricks ecosystems
Manage job orchestration and monitoring (e.g. Control-M) ensuring SLA adherence and timely resolution - because timeliness is a core promise of the authoritative source of truth
Establish and enforce runbooks SOPs and escalation procedures tailored to CDPs criticality
Drive root cause analysis (RCA) and implement preventive measures to reduce recurring issues and protect data trust
Job Responsibilities - Reliability Engineering and Operations
Improve system reliability through automation observability proactive monitoring and near-real-time availability targets
Define and track SLAs SLIs and SLOs for critical CDP systems - with metrics aligned to data freshness accuracy and availability commitments
Partner with engineering teams to implement resiliency patterns failover strategies and capacity planning for population-scale data processing
Identify and eliminate operational bottlenecks and manual processes that threaten CDPs reliability and timeliness
Job Responsibilities - Compliance Security and Governance
Lead execution of compliance mandates audits and regulatory requirements impacting CDP systems - ensuring the platform that holds data for the entire US adult population meets the highest security standards
Manage and remediate security violations vulnerabilities and policy breaches with urgency
Oversee access controls audit readiness and governance processes in collaboration with security teams - protecting the trust that makes CDP authoritative
Ensure adherence to data protection and privacy standards across all customer data systems
Job Responsibilities - Platform Maintenance and Operational Hygiene
Manage patching upgrades and vulnerability remediation across CDP platforms
Lead password and credential rotation processes across systems and integrations
Ensure operational readiness for infrastructure and platform changes with zero-downtime deployment practices
Coordinate with vendors and platform teams for issue resolution and maintenance activities
Job Responsibilities - Collaboration and Leadership
Lead and coordinate onshore/offshore support teams ensuring effective coverage and handoffs for 24x7 operations
Partner with Data Engineering AI/ML and Platform teams to ensure operability and supportability of all CDP systems
Provide operational readiness reviews for new deployments and features before they enter production
Mentor team members and drive a culture of accountability ownership and continuous improvement
Education and Work Experience
Bachelors degree in Computer Science Engineering or related field
6 years of experience in production support SRE or platform operations roles
Proven experience managing 24x7 support models and distributed teams
Experience supporting large-scale data platforms in cloud environments (Azure preferred)
Experience with security compliance and audit processes for systems handling sensitive customer data
Experience with job orchestration tools (Control-M or similar)
Solid understanding of data pipelines ETL/ELT processes and distributed systems at scale
Experience with monitoring and observability tools (e.g. Azure Monitor Log Analytics Splunk Prometheus)
Familiarity with incident management tools and processes (PagerDuty ServiceNow etc.)
Experience with CI/CD pipelines and release management
Knowledge of security practices access control encryption and compliance frameworks relevant to customer data
Scripting experience (Python Shell) for automation and operational tooling
Knowledge Skills and Abilities
Strong operational mindset with unwavering focus on stability reliability and uptime for a platform the entire business depends on
Ability to manage high-pressure production incidents and drive resolution with urgency and precision
Deep understanding of why platform reliability and security are foundational to CDPs authority as the source of truth
Strong problem-solving and root cause analysis skills
Excellent coordination and communication across engineering security and business teams
Ability to balance short-term fixes with long-term reliability improvements
Leadership skills in managing global support teams and rotations
Senior Reliability Engineer - Customer Data Platform CDP MISSION: Our mission is to be the authoritative source of truth for customer data - delivering timely high-quality data at scale to power the contextual experiences that drive the growth of this company. Every customer profile must be accurate...
Senior Reliability Engineer - Customer Data Platform
CDP MISSION: Our mission is to be the authoritative source of truth for customer data - delivering timely high-quality data at scale to power the contextual experiences that drive the growth of this company. Every customer profile must be accurate trusted and available when it matters across every touchpoint for the entire US adult population.
Job Overview
We are seeking a Senior Reliability Engineer to own production excellence for our Customer Data Platform (CDP) - the authoritative source of truth for customer data across the entire US adult population.
An authoritative platform is only authoritative if it is available secure and timely. This role ensures exactly that: high availability operational resilience and compliance for the critical data systems that power customer experiences across every touchpoint. You will lead 24x7 production support incident management platform governance and security compliance - ensuring CDP remains the trusted foundation the business depends on.
You will act as the bridge between engineering platform security and compliance teams driving the operational discipline that keeps CDP resilient secure and audit-ready at all times.
Job Responsibilities - KTLO Leadership and Production Support
Lead KTLO operations including 24x7 monitoring incident management and on-call processes - understanding that CDP downtime directly impacts customer experiences and business decisions
Oversee production support for data pipelines APIs and platform services across Azure and Databricks ecosystems
Manage job orchestration and monitoring (e.g. Control-M) ensuring SLA adherence and timely resolution - because timeliness is a core promise of the authoritative source of truth
Establish and enforce runbooks SOPs and escalation procedures tailored to CDPs criticality
Drive root cause analysis (RCA) and implement preventive measures to reduce recurring issues and protect data trust
Job Responsibilities - Reliability Engineering and Operations
Improve system reliability through automation observability proactive monitoring and near-real-time availability targets
Define and track SLAs SLIs and SLOs for critical CDP systems - with metrics aligned to data freshness accuracy and availability commitments
Partner with engineering teams to implement resiliency patterns failover strategies and capacity planning for population-scale data processing
Identify and eliminate operational bottlenecks and manual processes that threaten CDPs reliability and timeliness
Job Responsibilities - Compliance Security and Governance
Lead execution of compliance mandates audits and regulatory requirements impacting CDP systems - ensuring the platform that holds data for the entire US adult population meets the highest security standards
Manage and remediate security violations vulnerabilities and policy breaches with urgency
Oversee access controls audit readiness and governance processes in collaboration with security teams - protecting the trust that makes CDP authoritative
Ensure adherence to data protection and privacy standards across all customer data systems
Job Responsibilities - Platform Maintenance and Operational Hygiene
Manage patching upgrades and vulnerability remediation across CDP platforms
Lead password and credential rotation processes across systems and integrations
Ensure operational readiness for infrastructure and platform changes with zero-downtime deployment practices
Coordinate with vendors and platform teams for issue resolution and maintenance activities
Job Responsibilities - Collaboration and Leadership
Lead and coordinate onshore/offshore support teams ensuring effective coverage and handoffs for 24x7 operations
Partner with Data Engineering AI/ML and Platform teams to ensure operability and supportability of all CDP systems
Provide operational readiness reviews for new deployments and features before they enter production
Mentor team members and drive a culture of accountability ownership and continuous improvement
Education and Work Experience
Bachelors degree in Computer Science Engineering or related field
6 years of experience in production support SRE or platform operations roles
Proven experience managing 24x7 support models and distributed teams
Experience supporting large-scale data platforms in cloud environments (Azure preferred)
Experience with security compliance and audit processes for systems handling sensitive customer data