DescriptionAs a Site Reliability Engineer at Ford Motor Company you will play a pivotal role in elevating the performance and dependability of our Marketing and Sales Tech platform and this essential position your responsibilities will include closely collaborating with diverse teams across the organization to fortify our systems ensuring they are not only robust and scalable but also equipped to efficiently manage the complexities of a global customer base. Your expertise in site reliability will be crucial in driving ongoing enhancements to our technology landscape. This continuous improvement effort is vital to maintaining Fords leadership in innovation within the automotive industry helping us set standards in service reliability and satisfaction. Your contributions will directly impact the smooth operation and evolutionary growth of our MS Tech capabilities aligning with Fords commitment to excellence and innovation.
ResponsibilitiesIncident Management & Operational Excellence
- 24/7 Response: Participate in a 24/7 on-call rotation providing rapid response to critical incidents and ensuring high availability for the NA eCommerce platform.
- Incident Triage: Act as a decisive member of triage teams to diagnose troubleshoot and resolve complex production issues directly contributing to the reduction of Mean Time to Recovery (MTTR).
- Standardization: Diligently execute and contribute to the continuous improvement of operational runbooks and Standard Operating Procedures (SOPs) to ensure consistent incident response.
Problem Management & Root Cause Analysis
- Blameless Culture: Lead and participate in blameless post-mortems and Root Cause Analysis (RCA) sessions to identify systemic weaknesses and implement preventative measures.
- Strategic Engineering: Partner with cross-functional development and platform teams to architect long-term reliability solutions based on RCA findings.
Service Level Management (SLIs SLOs & Error Budgets)
- Reliability Framework: Define and track meaningful Service Level Indicators (SLIs) and Objectives (SLOs) to measure service availability and performance.
- Stakeholder Alignment: Collaborate with Product Owners to establish acceptable service levels and manage Error Budgets to balance velocity with reliability.
- Performance Insight: Provide critical analysis during monthly release reviews evaluating the impact of changes on service health and SLO adherence.
Modern Observability & Monitoring
- Full-Stack Visibility: Leverage and optimize Fords observability suite (Dynatrace GCP Logging etc.) to monitor system health and proactively identify anomalies.
- Gap Remediation: Identify and document observability blind spots implementing technical solutions to ensure comprehensive system visibility.
- Monitoring-as-Code: Manage and refine metric collection dashboard creation and alert definitions utilizing Terraform to provision monitoring infrastructure following Ford standards.
- Alerting Strategy: Design robust notification strategies and thresholds to alert stakeholders of KPI/SLO violations using Error Budget signals.
Automation & Reliability Engineering
- Toil Reduction: Champion the elimination of manual repetitive tasks by developing automation scripts tools and streamlined workflows.
- Resilience Engineering: Design and implement self-healing mechanisms to automatically detect and remediate common system failures reducing the need for manual intervention.
- Next-Gen Tech: Implement and manage AI-driven observability solutions to enhance proactive system monitoring and predictive maintenance.
Collaboration & Communication
- Cross-Functional Leadership: Coordinate with platform and engineering teams to resolve production bottlenecks and drive continuous process improvements.
- Strategic Reporting: Deliver clear data-driven status reports on system health incident trends and SRE initiatives to leadership and program stakeholders.
Qualifications- Education: Bachelors in computer science or related field.
- Experience: Minimum of 5 years of professional experience in Site Reliability Engineering or DevOps.
- Cloud Fluency (GCP): Deep hands-on experience with Google Cloud Platform specifically Cloud Run GKE and OpenShift. You understand the nuances of container orchestration and serverless architectures.
- Infrastructure as Code (IaC): Advanced proficiency in Terraform. You must have experience writing reusable modules managing state and automating infrastructure provisioning (not just running existing scripts).
- Observability Engineering: Experience in comprehensive system observability using primary telemetry Metrics Events Logs and Traces.
- Hands-on experience with Dynatrace (or similar APM tools like Datadog/New Relic) including distributed tracing synthetic monitoring and code-level profiling.
- Coding & Scripting: Proficiency in at least one high-level programming language (Java Python or Go). You will need to read application code to assist with instrumentation and debug complex production issues.
- Incident Management: Proven experience managing high-severity incidents. You understand the lifecycle of an incident: Triage -> Mitigation -> Resolution -> Blameless Post-Mortem (RCA).
Grade 7 or 8.
#LI-On-Site
#LI-DS2
Required Experience:
IC
DescriptionAs a Site Reliability Engineer at Ford Motor Company you will play a pivotal role in elevating the performance and dependability of our Marketing and Sales Tech platform and this essential position your responsibilities will include closely collaborating with diverse teams across the org...
DescriptionAs a Site Reliability Engineer at Ford Motor Company you will play a pivotal role in elevating the performance and dependability of our Marketing and Sales Tech platform and this essential position your responsibilities will include closely collaborating with diverse teams across the organization to fortify our systems ensuring they are not only robust and scalable but also equipped to efficiently manage the complexities of a global customer base. Your expertise in site reliability will be crucial in driving ongoing enhancements to our technology landscape. This continuous improvement effort is vital to maintaining Fords leadership in innovation within the automotive industry helping us set standards in service reliability and satisfaction. Your contributions will directly impact the smooth operation and evolutionary growth of our MS Tech capabilities aligning with Fords commitment to excellence and innovation.
ResponsibilitiesIncident Management & Operational Excellence
- 24/7 Response: Participate in a 24/7 on-call rotation providing rapid response to critical incidents and ensuring high availability for the NA eCommerce platform.
- Incident Triage: Act as a decisive member of triage teams to diagnose troubleshoot and resolve complex production issues directly contributing to the reduction of Mean Time to Recovery (MTTR).
- Standardization: Diligently execute and contribute to the continuous improvement of operational runbooks and Standard Operating Procedures (SOPs) to ensure consistent incident response.
Problem Management & Root Cause Analysis
- Blameless Culture: Lead and participate in blameless post-mortems and Root Cause Analysis (RCA) sessions to identify systemic weaknesses and implement preventative measures.
- Strategic Engineering: Partner with cross-functional development and platform teams to architect long-term reliability solutions based on RCA findings.
Service Level Management (SLIs SLOs & Error Budgets)
- Reliability Framework: Define and track meaningful Service Level Indicators (SLIs) and Objectives (SLOs) to measure service availability and performance.
- Stakeholder Alignment: Collaborate with Product Owners to establish acceptable service levels and manage Error Budgets to balance velocity with reliability.
- Performance Insight: Provide critical analysis during monthly release reviews evaluating the impact of changes on service health and SLO adherence.
Modern Observability & Monitoring
- Full-Stack Visibility: Leverage and optimize Fords observability suite (Dynatrace GCP Logging etc.) to monitor system health and proactively identify anomalies.
- Gap Remediation: Identify and document observability blind spots implementing technical solutions to ensure comprehensive system visibility.
- Monitoring-as-Code: Manage and refine metric collection dashboard creation and alert definitions utilizing Terraform to provision monitoring infrastructure following Ford standards.
- Alerting Strategy: Design robust notification strategies and thresholds to alert stakeholders of KPI/SLO violations using Error Budget signals.
Automation & Reliability Engineering
- Toil Reduction: Champion the elimination of manual repetitive tasks by developing automation scripts tools and streamlined workflows.
- Resilience Engineering: Design and implement self-healing mechanisms to automatically detect and remediate common system failures reducing the need for manual intervention.
- Next-Gen Tech: Implement and manage AI-driven observability solutions to enhance proactive system monitoring and predictive maintenance.
Collaboration & Communication
- Cross-Functional Leadership: Coordinate with platform and engineering teams to resolve production bottlenecks and drive continuous process improvements.
- Strategic Reporting: Deliver clear data-driven status reports on system health incident trends and SRE initiatives to leadership and program stakeholders.
Qualifications- Education: Bachelors in computer science or related field.
- Experience: Minimum of 5 years of professional experience in Site Reliability Engineering or DevOps.
- Cloud Fluency (GCP): Deep hands-on experience with Google Cloud Platform specifically Cloud Run GKE and OpenShift. You understand the nuances of container orchestration and serverless architectures.
- Infrastructure as Code (IaC): Advanced proficiency in Terraform. You must have experience writing reusable modules managing state and automating infrastructure provisioning (not just running existing scripts).
- Observability Engineering: Experience in comprehensive system observability using primary telemetry Metrics Events Logs and Traces.
- Hands-on experience with Dynatrace (or similar APM tools like Datadog/New Relic) including distributed tracing synthetic monitoring and code-level profiling.
- Coding & Scripting: Proficiency in at least one high-level programming language (Java Python or Go). You will need to read application code to assist with instrumentation and debug complex production issues.
- Incident Management: Proven experience managing high-severity incidents. You understand the lifecycle of an incident: Triage -> Mitigation -> Resolution -> Blameless Post-Mortem (RCA).
Grade 7 or 8.
#LI-On-Site
#LI-DS2
Required Experience:
IC
View more
View less