Senior Manager of Engineering, Production Infrastructure

Boston, NH - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Klaviyo powers growth for thousands of businesses and our R&D teams build on shared platform primitives. As the Senior Manager Production Infrastructure youll lead the teams behind our paved roadscompute runtimes service networking/ingress and observabilityso product engineers can move fast on a stable costdisciplined foundation. Youll publish opinionated defaults (golden paths) install SLO discipline and make reliability and developer experience measurable across the company.

This is a handson leadership role: youll stay close to architecture and operations review designs and PRs jump into incidents when needed and prototype reference solutions that set the standard.

How Youll Make a Difference

Own and evolve platform primitives in scope (compute runtimes service networking/ingress observability) with clear APIs SLOs runbooks and support tiers.
Lead by example technically: drive design reviews review PRs and author reference implementations starter repos and Terraform/Helm modules that demonstrate the golden path.
Deliver golden paths and selfservice scaffolding; reduce timetofirstservice and lead time for changes.
Raise the bar on reliability: incident response (blameless) alert hygiene capacity planning and oncall health.
Be productionclose: participate in critical incident response and postmortems; trace issues across Kubernetes service mesh and data paths; convert learnings into durable fixes guardrails and policyascode.
Standardize observability endtoend: expand OpenTelemetry adoption define log/trace schemas and make SLOs and error budgets firstclass in dashboards and alerts.
Evolve our Kubernetes and networking layers: plan cluster upgrades rightsize node/Pod configs harden ingress/gateway policies and advance mTLS/service identity and traffic shaping.
Advance CI/CD and GitOps: ensure fast safe deploys with progressive delivery automatic rollbacks and preprod environments that mirror prod; enforce guardrails via policyascode.
Stand up a concise scorecard (SLO coverage incident frequency/severity lead time MTTR developer platform NPS costtoserve) and drive consistent trend improvements.
Partner with Security Data Platform and Product to clarify ownership boundaries and enable safe fast delivery.
Improve costtoserve via quotas rightsizing and showback in partnership with Finance.
Transform workflows by putting AI at the center building smarter systems and ways of working from the ground up; pilot AIassisted runbooks and incident summarization to shorten resolution time.

Who You Are

710 years in infra/SRE/platform with 35 years leading teams (including managers or staff/lead ICs).
Demonstrated SRE practices (SLI/SLO design incident mgmt capacity planning) and experience with Kubernetes/container orchestration service networking IaC and modern observability.
Technically credible and handson: comfortable reading and discussing code (e.g. Go Python or Java) reviewing PRs and writing small prototypes/tooling when it accelerates the team.
Fluent with Kubernetes internals (scheduling autoscaling resource management) and service networking (e.g. Envoy/Istio/Linkerd API gateways).
Operate the full observability stack (metrics logs traces profiling) and instrument SLIs/SLOs using OpenTelemetryfriendly patterns.
Automate by default: Terraform (or Pulumi) Helm/Kustomize GitOps CI/CD; you prefer guardrails and policyascode over manual gates.
You write crisp docs/diagrams and define platform contracts that hold up under scale.
You drive measurable developer velocity and reliability improvements and communicate progress with clarity.
You build inclusive hightrust teams and partner tightly across Security/Product/Finance.
Youve already experimented with AI in work or personal projects and are eager to deepen your fluency responsibly.

Nice to Haves

Platforms as a product (DX metrics roadmaps) eventdriven architectures and costtoserve optimization in highgrowth SaaS.
Experience contributing to platform code or tooling (e.g. base images CLI/scaffolding controllers/operators admission/policy) multicluster or multiregion operations and progressive delivery.

We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.

Please see the independent bias audit report covering our use of Covey here

Required Experience:

Senior Manager

This is a handson leadership role: youll stay close to architecture and operations review designs and PRs jump into incidents when needed and prototype reference solutions that set the standard.

How Youll Make a Difference

Own and evolve platform primitives in scope (compute runtimes service networking/ingress observability) with clear APIs SLOs runbooks and support tiers.
Lead by example technically: drive design reviews review PRs and author reference implementations starter repos and Terraform/Helm modules that demonstrate the golden path.
Deliver golden paths and selfservice scaffolding; reduce timetofirstservice and lead time for changes.
Raise the bar on reliability: incident response (blameless) alert hygiene capacity planning and oncall health.
Be productionclose: participate in critical incident response and postmortems; trace issues across Kubernetes service mesh and data paths; convert learnings into durable fixes guardrails and policyascode.
Standardize observability endtoend: expand OpenTelemetry adoption define log/trace schemas and make SLOs and error budgets firstclass in dashboards and alerts.
Evolve our Kubernetes and networking layers: plan cluster upgrades rightsize node/Pod configs harden ingress/gateway policies and advance mTLS/service identity and traffic shaping.
Advance CI/CD and GitOps: ensure fast safe deploys with progressive delivery automatic rollbacks and preprod environments that mirror prod; enforce guardrails via policyascode.
Stand up a concise scorecard (SLO coverage incident frequency/severity lead time MTTR developer platform NPS costtoserve) and drive consistent trend improvements.
Partner with Security Data Platform and Product to clarify ownership boundaries and enable safe fast delivery.
Improve costtoserve via quotas rightsizing and showback in partnership with Finance.
Transform workflows by putting AI at the center building smarter systems and ways of working from the ground up; pilot AIassisted runbooks and incident summarization to shorten resolution time.

Who You Are

710 years in infra/SRE/platform with 35 years leading teams (including managers or staff/lead ICs).
Demonstrated SRE practices (SLI/SLO design incident mgmt capacity planning) and experience with Kubernetes/container orchestration service networking IaC and modern observability.
Technically credible and handson: comfortable reading and discussing code (e.g. Go Python or Java) reviewing PRs and writing small prototypes/tooling when it accelerates the team.
Fluent with Kubernetes internals (scheduling autoscaling resource management) and service networking (e.g. Envoy/Istio/Linkerd API gateways).
Operate the full observability stack (metrics logs traces profiling) and instrument SLIs/SLOs using OpenTelemetryfriendly patterns.
Automate by default: Terraform (or Pulumi) Helm/Kustomize GitOps CI/CD; you prefer guardrails and policyascode over manual gates.
You write crisp docs/diagrams and define platform contracts that hold up under scale.
You drive measurable developer velocity and reliability improvements and communicate progress with clarity.
You build inclusive hightrust teams and partner tightly across Security/Product/Finance.
Youve already experimented with AI in work or personal projects and are eager to deepen your fluency responsibly.

Nice to Haves

Platforms as a product (DX metrics roadmaps) eventdriven architectures and costtoserve optimization in highgrowth SaaS.
Experience contributing to platform code or tooling (e.g. base images CLI/scaffolding controllers/operators admission/policy) multicluster or multiregion operations and progressive delivery.

Please see the independent bias audit report covering our use of Covey here

Required Experience:

Senior Manager

Key Skills

Apply Now

About Company

Klaviyo

Klaviyo unifies AI-powered email marketing and SMS to drive growth, retention, and measurable results. Build personalized, omnichannel experiences across WhatsApp, ecommerce, and more with K:AI Agents.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Senior Manager of Engineering, Production Infrastructure

Boston, NH - USA

Job Summary

How Youll Make a Difference

Who You Are

Nice to Haves

How Youll Make a Difference

Who You Are

Nice to Haves

Key Skills

About Company

Related Jobs