Software Engineer II, Reliability

Dublin - Ireland

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

At Klaviyo we value the unique backgrounds experiences and perspectives each Klaviyo (we call ourselves Klaviyos) brings to our workplace each and every day. We believe everyone deserves a fair shot at success and appreciate the experiences each person brings beyond the traditional job requirements. If youre a close but not exact match with the description we hope youll still consider applying. Want to learn more about life at Klaviyo Visit see how we empower creators to own their own destiny.

Software Engineer II Reliability(Dublin)

Team Overview:

As a Software Engineer II Reliability you will help ensure Klaviyos critical platforms are reliable scalable and sustainable while enabling rapid product development.

We treat reliability as a core product feature and use software engineering to solve complex systems and operational challenges. Our work spans infrastructure security and software engineering and focuses on building and operating systems that are reliable secure and performant at scale.

The SRE teams charter is to build and operate foundational services and infrastructure reduce operational toil through automation and continuously improve systems based on real production learnings. Your work will directly impact how Klaviyo engineers build software and how customers experience our platform every day.

How Youll Make an Impact:

As a Software Engineer II Reliability you will contribute to the reliability and operational excellence of Klaviyos platforms by working on well-scoped projects and owning services with support from senior engineers. You will:

Build operate and improve production systems with a focus on reliability scalability and performance
Apply software engineering principles to automate operational tasks and reduce manual toil
Contribute to the design and implementation of systems using established SRE best practices
Help define and measure SLIs and SLOs for services you support
Improve observability through metrics dashboards logging and tracing
Participate in on-call rotations and respond to production incidents with guidance and support
Assist with incident investigation and contribute to post-incident reviews and follow-up actions
Perform basic analysis around system behavior capacity usage and scaling characteristics
Identify reliability issues or operational pain points and work with teammates to address them
Collaborate with product platform and security engineers to ship reliable systems
Write and maintain clear operational runbooks and system documentation

Who You Are:

You are an early-to-mid career SRE who is comfortable operating production systems and eager to deepen your expertise in reliability :

Have experience operating cloud-native production systems and services
Write production-quality code (e.g. Python Go or similar) to automate operations and improve reliability
Understand common failure modes in distributed systems such as dependency failures resource exhaustion and partial outages
Have experience working with containerized workloads and platforms (e.g. Kubernetes) in production environments
Are comfortable participating in on-call rotations and diagnosing straightforward production issues
Have experience using observability tools and responding to alerts
Are familiar with SRE concepts such as SLIs SLOs and error budgets and are learning how to apply them in practice
Have hands-on experience with infrastructure as code or declarative configuration (e.g. Terraform Kubernetes manifests)
Can follow incident response processes and contribute meaningfully during outages
Are comfortable receiving feedback learning from incidents and improving your systems over time
Youve already experimented with AI in work or personal projects and youre excited to dive in and learn fast. Youre hungry to responsibly explore new AI tools and workflows finding ways to make your work smarter and more efficient.

Nice to Have:

Experience supporting security-sensitive systems or internal platforms
Familiarity with AWS or other cloud providers
Exposure to messaging or asynchronous systems (e.g. Kafka RabbitMQ Celery)
Interest in performance testing capacity planning or resilience work
Practical experience with algorithms and data structures

Tech Stack:

Klaviyos platform is primarily built with Python and React and runs on AWS. Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.

Core technologies include:

Python / Django / FastAPI
MySQL / Redis / Memcached
RabbitMQ / Celery / Apache Kafka / Apache Pulsar
AWS / Terraform / Kubernetes

Location & Work Model:

This role is based in Dublin Ireland and follows a hybrid working model. Klaviyo supports work authorization and relocation for this position.

At Klaviyo we value people who take ownership learn continuously and collaborate openly. We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.

We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.

Please see the independent bias audit report covering our use of Covey here

Required Experience:

Software Engineer II Reliability(Dublin)

Team Overview:

As a Software Engineer II Reliability you will help ensure Klaviyos critical platforms are reliable scalable and sustainable while enabling rapid product development.

How Youll Make an Impact:

Build operate and improve production systems with a focus on reliability scalability and performance
Apply software engineering principles to automate operational tasks and reduce manual toil
Contribute to the design and implementation of systems using established SRE best practices
Help define and measure SLIs and SLOs for services you support
Improve observability through metrics dashboards logging and tracing
Participate in on-call rotations and respond to production incidents with guidance and support
Assist with incident investigation and contribute to post-incident reviews and follow-up actions
Perform basic analysis around system behavior capacity usage and scaling characteristics
Identify reliability issues or operational pain points and work with teammates to address them
Collaborate with product platform and security engineers to ship reliable systems
Write and maintain clear operational runbooks and system documentation

Who You Are:

You are an early-to-mid career SRE who is comfortable operating production systems and eager to deepen your expertise in reliability :

Have experience operating cloud-native production systems and services
Write production-quality code (e.g. Python Go or similar) to automate operations and improve reliability
Understand common failure modes in distributed systems such as dependency failures resource exhaustion and partial outages
Have experience working with containerized workloads and platforms (e.g. Kubernetes) in production environments
Are comfortable participating in on-call rotations and diagnosing straightforward production issues
Have experience using observability tools and responding to alerts
Are familiar with SRE concepts such as SLIs SLOs and error budgets and are learning how to apply them in practice
Have hands-on experience with infrastructure as code or declarative configuration (e.g. Terraform Kubernetes manifests)
Can follow incident response processes and contribute meaningfully during outages
Are comfortable receiving feedback learning from incidents and improving your systems over time
Youve already experimented with AI in work or personal projects and youre excited to dive in and learn fast. Youre hungry to responsibly explore new AI tools and workflows finding ways to make your work smarter and more efficient.

Nice to Have:

Experience supporting security-sensitive systems or internal platforms
Familiarity with AWS or other cloud providers
Exposure to messaging or asynchronous systems (e.g. Kafka RabbitMQ Celery)
Interest in performance testing capacity planning or resilience work
Practical experience with algorithms and data structures

Tech Stack:

Klaviyos platform is primarily built with Python and React and runs on AWS. Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.

Core technologies include:

Python / Django / FastAPI
MySQL / Redis / Memcached
RabbitMQ / Celery / Apache Kafka / Apache Pulsar
AWS / Terraform / Kubernetes

Location & Work Model:

This role is based in Dublin Ireland and follows a hybrid working model. Klaviyo supports work authorization and relocation for this position.

At Klaviyo we value people who take ownership learn continuously and collaborate openly. We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.

Please see the independent bias audit report covering our use of Covey here

Required Experience:

Key Skills

Apply Now

About Company

Klaviyo

Klaviyo unifies AI-powered email marketing and SMS to drive growth, retention, and measurable results. Build personalized, omnichannel experiences across WhatsApp, ecommerce, and more with K:AI Agents.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Software Engineer II, Reliability

Dublin - Ireland

Job Summary

Software Engineer II Reliability(Dublin)

Team Overview:

How Youll Make an Impact:

Who You Are:

Nice to Have:

Tech Stack:

Location & Work Model:

Software Engineer II Reliability(Dublin)

Team Overview:

How Youll Make an Impact:

Who You Are:

Nice to Have:

Tech Stack:

Location & Work Model:

Key Skills

About Company

Related Jobs