dunnhumby is the global leader in Customer Data Science empowering businesses everywhere to compete and thrive in the modern datadriven economy. We always put the Customer First.
Our mission: to enable businesses to grow and reimagine themselves by becoming advocates and champions for their Customers. With deep heritage and expertise in retail one of the worlds most competitive markets with a deluge of multidimensional data dunnhumby today enables businesses all over the world across industries to be Customer First.
dunnhumby employs nearly 2500 experts in offices throughout Europe Asia Africa and the Americas working for transformative iconic brands such as Tesco CocaCola Meijer Procter & Gamble and Metro.
Cloud Site Reliability Eginner ensures that dunnhumbys cloud hosted servicesboth our internally critical and our externallyvisible systemshave reliability and uptime appropriate to users needs and a fast rate of improvement while keeping an everwatchful eye on capacity and performance.
This is a unique opportunity to help transform how we deliver dunnhumbys cutting edge customer science and machine learning research and leverage our unique access to big data. At dunnhumby we are passionate about cloud and opensource technologies and are committed to a long term and sustainedinvestment in people aligned to our goals.
Key Accountabilities
- Lead and mentor a team of Site Reliability Engineers fostering a culture of collaboration learning and continuous improvement.
- Maintain and support infrastructure services in both development integration and production environments
- Design implement and manage robust scalable and highperformance systems and infrastructure.
- Ensure the reliability availability and performance of critical services through proactive monitoring incident response and root cause analysis.
- Drive the adoption of automation CI/CD practices and infrastructure as code (IaC) to streamline operations and improve operational efficiency.
- Collaborate with development teams to ensure that applications are designed for scalability reliability and fault tolerance.
- Define and enforce Service Level Objectives (SLOs) Service Level Indicators (SLIs) and Service Level Agreements (SLAs) to monitor and improve service health.
- Lead incident management troubleshooting and postmortems to identify and address operational challenges.
- Manage capacity planning scaling strategies and disaster recovery for cloudbased environments (GCP Azure).
- Drive improvements in operational tooling monitoring alerting and reporting.
- Act as a subject matter expert in reliability engineering best practices and promote these practices across the organization.
- Contribute to creating and improving processes for release management change management and configuration management.
- Participate in oncall rotations and respond to production incidents as necessary.
- Review services before they go live in production
- Enforce rigor on incident response and postmortems
- Design proactive monitoring and metrics against supported environment
- Focus on automation to improve scale and reliability
- Identifies and proposes alternative technology in order to create scalable implementations and achieve results
- Coordinate and troubleshoot complex technical issues until resolution
- Identify and prioritize what technical debt will be eliminated
- Identify opportunities to influence the roadmap of infrastructure services
Qualifications
- 8 years of experience in an engineering role with hands on experience in the public cloud; Google Cloud Platform (GCP) preferred however not limited to and exposure on any other public cloud provider ideally Azure.
- Strong experience in designing and managing largescale distributed systems.
- Expertise in cloud technologies (GCP Azure) and infrastructure automation tools (Terraform Ansible Puppet etc..
- Proficiency in containerization and orchestration technologies such as Docker Kubernetes and Helm.
- Experience with monitoring and observability tools like Prometheus Grafana NewRelic or similar.
- Strong knowledge of CI/CD pipelines and related automation tools.
- Proficient in scripting languages like Python Bash or Go.
- Strong troubleshooting and problemsolving skills in production environments.
- Experience leading and mentoring engineering teams with a strong focus on collaboration and communication.
- Familiarity with incident management processes and tools (e.g. ServiceNow XMatters).
- Experience with infrastructure as code (IaC) and version control systems (Git).
- Knowledge of scripting in Python/Bash
- Knowledge of Go programming language
- Knowledge of Ansible & Terraform for writing most of the infrastructure automation
- Experience with Kubernetes
- Understanding of metrics collectors such as Graphite or Prometheus
- Experience with DevOps tools
- Ability to learn and adapt in a fastpaced environment while producing quality code
- Ability to work collaboratively on a crossfunctional team with a wide range of experience levels
- Ability to analyse existing services and identify technical debt to work toward increasing sustainability
- Finds creative way to execute even when there is no historical context or known path forward
- Ability to design roadmaps and relevant solutions for endusers to access interfaces
- Ability to assess the benefits risks and success factors of potential applications
- Strong mentoring and coaching skills that encourage growth for more junior members
What you can expect from us
We wont just meet your expectations. Well defy them. So youll enjoy the comprehensive rewards package youd expect from a leading technology company. But also a degree of personal flexibility you might not expect. Plus thoughtful perks like flexible working hours and your birthday off.
Youll also benefit from an investment in cuttingedge technology that reflects our global ambition. But with a nimble smallbusiness feel that gives you the freedom to play experiment and learn.
And we dont just talk about diversity and inclusion. We live it every day with thriving networks including dh Gender Equality Network dh Proud dh Family dh One and dh Thrive as the living proof. We want everyone to have the opportunity to shine and perform at your best throughout our recruitment process. Please let us know how we can make this process work best for you. For an informal and confidential chat please contact to discuss how we can meet your needs.
Our approach to Flexible Working
At dunnhumby we value and respect difference and are committed to building an inclusive culture by creating an environment where you can balance a successful career with your commitments and interests outside of work.
We believe that you will do your best at work if you have a work / life balance. Some roles lend themselves to flexible options more than others so if this is important to you please raise this with your recruiter as we are open to discussing agile working opportunities during the hiring process.
For further information about how we collect and use your personal information please see our Privacy Notice which can be found (here)