Manager of Site Reliability Engineering (SRE)

Genuine Parts Company

Job Location:

Birmingham, MI - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

SUMMARY:

The Manager of Site Reliability Engineering leads and develops a team of SRE practitioners focused on delivering highly reliable scalable and performant cloud-based infrastructure and services. This role ensures the implementation of SRE principles drives automation observability and incident management practices to enhance system reliability and collaborates across development and operations teams to support continuous delivery and robust cloud platform operations.

You must be eligible to work in the US without Visa Sponsorship

JOB DUTIES

Lead mentor and grow a high-performing team of Site Reliability Engineers fostering a culture of ownership continuous improvement and operational excellence.

Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability availability and performance.

Define and track key SRE metrics such as service uptime incident response and resolution times.

Drive automation efforts including CI/CD pipeline enhancements infrastructure-as-code practices and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil.

Own and continuously improve observability practices including system monitoring logging alerting and diagnostics to ensure rapid issue detection and resolution.

Participate in incident response processes including incident management root cause analysis post-mortems and continuous improvement to enhance system resilience.

Partner closely with software engineering product management architecture and security teams to embed reliability and security early in the software development lifecycle (SDLC).

Oversee the management and scalability of cloud infrastructure environments primarily on Google Cloud Platform (GCP) with a focus on Kubernetes container orchestration and hybrid cloud integrations.

Advocate for and apply best practices in performance tuning capacity planning and system design for high availability.

Develop and execute a long-term roadmap for our hybrid cloud platform aligning with evolving business objectives and technology trends.

Establish and monitor key performance indicators (KPIs) service level indicators (SLIs) and service level objectives (SLOs) to drive system health and stability.

EDUCATION & EXPERIENCE

Typically requires a bachelors degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination

KNOWLEDGE SKILLS ABILITIES

Experience & Leadership

Proven experience working in large complex enterprise environments (Fortune 500 or equivalent).

Site Reliability Engineering & DevOps Practices

Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale.

Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform and ArgoCD.

In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery.

Championing DevOps practices and embedding reliability early in the SDLC.

Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability.

Strong software engineering background with proficiency in infrastructure-as-code tools (e.g. Terraform ArgoCD) and CI/CD automation.

Deep knowledge of cloud platforms specifically Google Cloud Platform (GCP) Kubernetes container orchestration and cloud-native architecture.

Familiarity with monitoring and observability tools such as Dynatrace Datadog or equivalents.

Experience managing high-availability systems in 24/7 operational environments.

Ability to collaborate cross-functionally and drive alignment across engineering product and security teams.

Tools & Monitoring

Experience with monitoring logging and observability platforms.

Familiarity with incident management and performance monitoring tools including Dynatrace and Datadog.

Proficient in cloud deployment tooling and automation frameworks.

Experience with Azure DevOps (ADO) or equivalent CI/CD tools.

Core Technical Skills

Strong software engineering and infrastructure background.

Solid understanding of Kubernetes container orchestration cluster management and elastic scalability.

Experience with API-driven event driven and microservices architectures.

Skilled in performance diagnostics capacity planning tuning and system architecture for high-availability systems.

Not the right fit Let us know youre interested in a future opportunity by joining our Talent Community on or create an account to set up email alerts as new job postings become available that meet your interest!

GPC conducts its business without regard to sex race creed color religion marital status national origin citizenship status age pregnancy sexual orientation gender identity or expression genetic information disability military status status as a veteran or any other protected characteristic. GPCs policy is to recruit hire train promote assign transfer and terminate employees based on their own ability achievement experience and conduct and other legitimate business reasons.

Required Experience:

Manager

SUMMARY:The Manager of Site Reliability Engineering leads and develops a team of SRE practitioners focused on delivering highly reliable scalable and performant cloud-based infrastructure and services. This role ensures the implementation of SRE principles drives automation observability and inciden...