Lead Site Reliability Engineer Site Reliability Engineering (Dublin)
Team Overview
As a Lead Site Reliability Engineer you will set technical direction and lead reliability strategy for Klaviyos most critical platforms. Youll ensure our systems are reliable scalable and sustainable while enabling rapid product development across the company.
We treat reliability as a core product feature. Our work spans security infrastructure and software engineering requiring deep systems thinking and strong technical leadership. We build foundational services that must be extremely reliable secure and performant at global scale.
The SRE teams charter is to design build and operate foundational infrastructure and services define reliability standards reduce operational toil through automation and continuously improve systems based on production learnings. As a lead your work will be highly visible and will directly influence how Klaviyo builds software and how customers experience our platform every day.
How youll make an impact
As a Lead Site Reliability Engineer you will provide technical leadership while remaining hands-on with the systems that underpin Klaviyos reliability and operational excellence. You will:
- Set the technical vision and long-term strategy for reliability availability and operational excellence across critical platforms
- Lead the design implementation and evolution of foundational security-critical services with strong guarantees around availability scalability latency and fault tolerance
- Drive adoption of SRE best practices across engineering teams including SLIs SLOs error budgets and reliability-based decision making
- Identify systemic reliability risks and architectural bottlenecks and lead cross-team initiatives to address them with durable preventative solutions
- Apply software engineering principles to automate infrastructure eliminate operational toil and improve system reliability at scale
- Own and continuously improve observability alerting and incident response practices to reduce mean time to detection and recovery
- Guide on-call strategy and operational processes to ensure sustainability automation and healthy operational load
- Perform and lead quantitative analysis around system behavior capacity planning scaling limits and performance characteristics
- Partner closely with product platform and security leaders to influence system architecture early and ensure reliability is built in from the start
- Lead incident response for high-severity events driving effective mitigation communication and follow-up
- Mentor senior and mid-level engineers raising the bar for technical quality operational maturity and reliability culture across the organization
- Review and influence technical designs platform APIs operational runbooks and system documentation at an organizational level
- Youve already experimented with AI in work or personal projects and youre excited to dive in and learn fast. Youre hungry to responsibly explore new AI tools and workflows finding ways to make your work smarter and more efficient.
Who you are
You are a senior technical leader who combines deep systems expertise with strong judgment and influence.
You:
- Are a cloud-native platform-focused SRE who uses software to design and operate highly reliable production systems at scale
- Write and maintain production-quality code (e.g. Python Go or similar) to build internal platforms automate operations and improve system reliability
- Have led the design and operation of distributed cloud-native systems and deeply understand failure modes such as partial outages dependency failures resource saturation and cascading impact
- Have extensive experience operating containerized workloads and platforms (e.g. Kubernetes) in production including deployment strategies scaling behavior and service networking
- Are comfortable owning on-call strategy and participating in escalation for complex production incidents
- Have designed and evolved observability platforms and alerting strategies that reflect real customer and service impact
- Apply SRE concepts such as SLIs SLOs error budgets and burn-ratebased alerting to guide engineering priorities and trade-offs at a team or org level
- Have strong hands-on experience with infrastructure as code and declarative configuration (e.g. Terraform Kubernetes manifests policy-as-code)
- Have led capacity planning load testing and performance analysis efforts for large-scale distributed systems
- Drive high-quality post-incident reviews and ensure concrete code-focused follow-up actions are delivered and sustained
- Are comfortable leading technical discussions influencing architecture and providing clear guidance across multiple teams
Nice to have
- Experience leading or supporting critical platforms or internal tooling
- Familiarity with identity access management secrets management or policy enforcement systems
- Experience operating systems at scale in cloud environments (AWS preferred)
- Background in resilience testing fault injection or chaos engineering
- Strong understanding of algorithms and data structures as they apply to large-scale systems
Tech Stack
Klaviyos platform is primarily built with Python and React and runs on AWS. Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.
Core technologies include:
- Python / Django / FastAPI
- MySQL / Redis / Memcached
- RabbitMQ / Celery / Apache Kafka / Apache Pulsar
- AWS / Terraform / Kubernetes
Location & Work Model
This role is based in Dublin Ireland and follows a hybrid working model. Klaviyo supports work authorization and relocation for this position.
At Klaviyo we enjoy tackling meaningful engineering challenges and value people who take ownership learn continuously and collaborate openly. We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.
We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.
Please see the independent bias audit report covering our use of Covey here
Required Experience:
IC
Lead Site Reliability Engineer Site Reliability Engineering (Dublin)Team OverviewAs a Lead Site Reliability Engineer you will set technical direction and lead reliability strategy for Klaviyos most critical platforms. Youll ensure our systems are reliable scalable and sustainable while enabling rap...
Lead Site Reliability Engineer Site Reliability Engineering (Dublin)
Team Overview
As a Lead Site Reliability Engineer you will set technical direction and lead reliability strategy for Klaviyos most critical platforms. Youll ensure our systems are reliable scalable and sustainable while enabling rapid product development across the company.
We treat reliability as a core product feature. Our work spans security infrastructure and software engineering requiring deep systems thinking and strong technical leadership. We build foundational services that must be extremely reliable secure and performant at global scale.
The SRE teams charter is to design build and operate foundational infrastructure and services define reliability standards reduce operational toil through automation and continuously improve systems based on production learnings. As a lead your work will be highly visible and will directly influence how Klaviyo builds software and how customers experience our platform every day.
How youll make an impact
As a Lead Site Reliability Engineer you will provide technical leadership while remaining hands-on with the systems that underpin Klaviyos reliability and operational excellence. You will:
- Set the technical vision and long-term strategy for reliability availability and operational excellence across critical platforms
- Lead the design implementation and evolution of foundational security-critical services with strong guarantees around availability scalability latency and fault tolerance
- Drive adoption of SRE best practices across engineering teams including SLIs SLOs error budgets and reliability-based decision making
- Identify systemic reliability risks and architectural bottlenecks and lead cross-team initiatives to address them with durable preventative solutions
- Apply software engineering principles to automate infrastructure eliminate operational toil and improve system reliability at scale
- Own and continuously improve observability alerting and incident response practices to reduce mean time to detection and recovery
- Guide on-call strategy and operational processes to ensure sustainability automation and healthy operational load
- Perform and lead quantitative analysis around system behavior capacity planning scaling limits and performance characteristics
- Partner closely with product platform and security leaders to influence system architecture early and ensure reliability is built in from the start
- Lead incident response for high-severity events driving effective mitigation communication and follow-up
- Mentor senior and mid-level engineers raising the bar for technical quality operational maturity and reliability culture across the organization
- Review and influence technical designs platform APIs operational runbooks and system documentation at an organizational level
- Youve already experimented with AI in work or personal projects and youre excited to dive in and learn fast. Youre hungry to responsibly explore new AI tools and workflows finding ways to make your work smarter and more efficient.
Who you are
You are a senior technical leader who combines deep systems expertise with strong judgment and influence.
You:
- Are a cloud-native platform-focused SRE who uses software to design and operate highly reliable production systems at scale
- Write and maintain production-quality code (e.g. Python Go or similar) to build internal platforms automate operations and improve system reliability
- Have led the design and operation of distributed cloud-native systems and deeply understand failure modes such as partial outages dependency failures resource saturation and cascading impact
- Have extensive experience operating containerized workloads and platforms (e.g. Kubernetes) in production including deployment strategies scaling behavior and service networking
- Are comfortable owning on-call strategy and participating in escalation for complex production incidents
- Have designed and evolved observability platforms and alerting strategies that reflect real customer and service impact
- Apply SRE concepts such as SLIs SLOs error budgets and burn-ratebased alerting to guide engineering priorities and trade-offs at a team or org level
- Have strong hands-on experience with infrastructure as code and declarative configuration (e.g. Terraform Kubernetes manifests policy-as-code)
- Have led capacity planning load testing and performance analysis efforts for large-scale distributed systems
- Drive high-quality post-incident reviews and ensure concrete code-focused follow-up actions are delivered and sustained
- Are comfortable leading technical discussions influencing architecture and providing clear guidance across multiple teams
Nice to have
- Experience leading or supporting critical platforms or internal tooling
- Familiarity with identity access management secrets management or policy enforcement systems
- Experience operating systems at scale in cloud environments (AWS preferred)
- Background in resilience testing fault injection or chaos engineering
- Strong understanding of algorithms and data structures as they apply to large-scale systems
Tech Stack
Klaviyos platform is primarily built with Python and React and runs on AWS. Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.
Core technologies include:
- Python / Django / FastAPI
- MySQL / Redis / Memcached
- RabbitMQ / Celery / Apache Kafka / Apache Pulsar
- AWS / Terraform / Kubernetes
Location & Work Model
This role is based in Dublin Ireland and follows a hybrid working model. Klaviyo supports work authorization and relocation for this position.
At Klaviyo we enjoy tackling meaningful engineering challenges and value people who take ownership learn continuously and collaborate openly. We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.
We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.
Please see the independent bias audit report covering our use of Covey here
Required Experience:
IC
View more
View less