Senior Reliability Engineer Customer Data Platform
Job Location:
Atlanta, GA - USA
Monthly Salary:
Not Disclosed
Posted on:
6 days ago
Vacancies:
1 Vacancy
Job Summary
Overview:
TekWissen is a global workforce management provider headquartered in Ann Arbor Michigan that offers strategic talent solutions to our clients world-wide. Our client provider of digital technology and transformation information technology and services
Position:Senior Reliability Engineer - Customer Data Platform
Location: Atlanta GA
Duration: 6 Months
Job Type: Temporary Assignment
Work Type:Onsite
JOB SUMMARY
- We are seeking a Senior Reliability Engineer to own production excellence for our Customer Data Platform (CDP) the authoritative source of truth for customer data across the entire US adult population.
- An authoritative platform is only authoritative if it is available secure and timely. This role ensures exactly that: high availability operational resilience and compliance for the critical data systems that power customer experiences across every touchpoint.
- You will lead 24x7 production support incident management platform governance and security compliance ensuring CDP remains the trusted foundation the business depends on.
- You will act as the bridge between engineering platform security and compliance teams driving the operational discipline that keeps CDP resilient secure and audit-ready at all times.
Job Responsibilities :
- KTLO Leadership and Production Support
- Lead KTLO operations including 24x7 monitoring incident management and on-call processes understanding that CDP downtime directly impacts customer experiences and business decisions
- Oversee production support for data pipelines APIs and platform services across Azure and Databricks ecosystems
- Manage job orchestration and monitoring (e.g. Control-M) ensuring SLA adherence and timely resolution - because timeliness is a core promise of the authoritative source of truth
- Establish and enforce runbooks SOPs and escalation procedures tailored to CDPs criticality
- Drive root cause analysis (RCA) and implement preventive measures to reduce recurring issues and protect data trust.
- Reliability Engineering and Operations
- Improve system reliability through automation observability proactive monitoring and near-real-time availability targets
- Define and track SLAs SLIs and SLOs for critical CDP systems with metrics aligned to data freshness accuracy and availability commitments
- Partner with engineering teams to implement resiliency patterns failover strategies and capacity planning for population-scale data processing
- Identify and eliminate operational bottlenecks and manual processes that threaten CDPs reliability and timeliness
- Compliance Security and Governance
- Lead execution of compliance mandates audits and regulatory requirements impacting CDP systems - ensuring the platform that holds data for the entire US adult population meets the highest security standards
- Manage and remediate security violations vulnerabilities and policy breaches with urgency
- Oversee access controls audit readiness and governance processes in collaboration with security teams - protecting the trust that makes CDP authoritative
- Ensure adherence to data protection and privacy standards across all customer data systems
- Platform Maintenance and Operational Hygiene
- Manage patching upgrades and vulnerability remediation across CDP platforms
- Lead password and credential rotation processes across systems and integrations
- Ensure operational readiness for infrastructure and platform changes with zero-downtime deployment practices
- Coordinate with vendors and platform teams for issue resolution and maintenance activities
- Collaboration and Leadership
- Lead and coordinate onshore/offshore support teams ensuring effective coverage and handoffs for 24x7 operations
- Partner with Data Engineering AI/ML and Platform teams to ensure operability and supportability of all CDP systems
- Provide operational readiness reviews for new deployments and features before they enter production
- Mentor team members and drive a culture of accountability ownership and continuous improvement
Education and Work Experience:
- Bachelors degree in Computer Science Engineering or related field
- 6 years of experience in production support SRE or platform operations roles
- Proven experience managing 24x7 support models and distributed teams
- Experience supporting large-scale data platforms in cloud environments (Azure preferred)
- Experience with security compliance and audit processes for systems handling sensitive customer data
Technical Skills:
- Strong experience with Azure ecosystem (ADLS Databricks ADF Event Hub etc.)
- Experience with job orchestration tools (Control-M or similar)
- Solid understanding of data pipelines ETL/ELT processes and distributed systems at scale
- Experience with monitoring and observability tools (e.g. Azure Monitor Log Analytics Splunk Prometheus)
- Familiarity with incident management tools and processes (PagerDuty ServiceNow etc.)
- Experience with CI/CD pipelines and release management
- Knowledge of security practices access control encryption and compliance frameworks relevant to customer data
- Scripting experience (Python Shell) for automation and operational tooling
Knowledge Skills and Abilities:
- Strong operational mindset with unwavering focus on stability reliability and uptime for a platform the entire business depends on
- Ability to manage high-pressure production incidents and drive resolution with urgency and precision
- Deep understanding of why platform reliability and security are foundational to CDPs authority as the source of truth
- Strong problem-solving and root cause analysis skills
- Excellent coordination and communication across engineering security and business teams
- Ability to balance short-term fixes with long-term reliability improvements
- Leadership skills in managing global support teams and rotations.
TekWissen Group is an equal opportunity employer supporting workforce diversity.