Senior Manager of Engineering, Production Infrastructure

Klaviyo

Not Interested
Bookmark
Report This Job

profile Job Location:

Boston, NH - USA

profile Monthly Salary: Not Disclosed
Posted on: 7 hours ago
Vacancies: 1 Vacancy

Job Summary

Klaviyo powers growth for thousands of businesses and our R&D teams build on shared platform primitives. As the Senior Manager Production Infrastructure youll lead the teams behind our paved roadscompute runtimes service networking/ingress and observabilityso product engineers can move fast on a stable costdisciplined foundation. Youll publish opinionated defaults (golden paths) install SLO discipline and make reliability and developer experience measurable across the company.

This is a handson leadership role: youll stay close to architecture and operations review designs and PRs jump into incidents when needed and prototype reference solutions that set the standard.

How Youll Make a Difference

  • Own and evolve platform primitives in scope (compute runtimes service networking/ingress observability) with clear APIs SLOs runbooks and support tiers.
  • Lead by example technically: drive design reviews review PRs and author reference implementations starter repos and Terraform/Helm modules that demonstrate the golden path.
  • Deliver golden paths and selfservice scaffolding; reduce timetofirstservice and lead time for changes.
  • Raise the bar on reliability: incident response (blameless) alert hygiene capacity planning and oncall health.
  • Be productionclose: participate in critical incident response and postmortems; trace issues across Kubernetes service mesh and data paths; convert learnings into durable fixes guardrails and policyascode.
  • Standardize observability endtoend: expand OpenTelemetry adoption define log/trace schemas and make SLOs and error budgets firstclass in dashboards and alerts.
  • Evolve our Kubernetes and networking layers: plan cluster upgrades rightsize node/Pod configs harden ingress/gateway policies and advance mTLS/service identity and traffic shaping.
  • Advance CI/CD and GitOps: ensure fast safe deploys with progressive delivery automatic rollbacks and preprod environments that mirror prod; enforce guardrails via policyascode.
  • Stand up a concise scorecard (SLO coverage incident frequency/severity lead time MTTR developer platform NPS costtoserve) and drive consistent trend improvements.
  • Partner with Security Data Platform and Product to clarify ownership boundaries and enable safe fast delivery.
  • Improve costtoserve via quotas rightsizing and showback in partnership with Finance.
  • Transform workflows by putting AI at the center building smarter systems and ways of working from the ground up; pilot AIassisted runbooks and incident summarization to shorten resolution time.

Who You Are

  • 710 years in infra/SRE/platform with 35 years leading teams (including managers or staff/lead ICs).
  • Demonstrated SRE practices (SLI/SLO design incident mgmt capacity planning) and experience with Kubernetes/container orchestration service networking IaC and modern observability.
  • Technically credible and handson: comfortable reading and discussing code (e.g. Go Python or Java) reviewing PRs and writing small prototypes/tooling when it accelerates the team.
  • Fluent with Kubernetes internals (scheduling autoscaling resource management) and service networking (e.g. Envoy/Istio/Linkerd API gateways).
  • Operate the full observability stack (metrics logs traces profiling) and instrument SLIs/SLOs using OpenTelemetryfriendly patterns.
  • Automate by default: Terraform (or Pulumi) Helm/Kustomize GitOps CI/CD; you prefer guardrails and policyascode over manual gates.
  • You write crisp docs/diagrams and define platform contracts that hold up under scale.
  • You drive measurable developer velocity and reliability improvements and communicate progress with clarity.
  • You build inclusive hightrust teams and partner tightly across Security/Product/Finance.
  • Youve already experimented with AI in work or personal projects and are eager to deepen your fluency responsibly.

Nice to Haves

  • Platforms as a product (DX metrics roadmaps) eventdriven architectures and costtoserve optimization in highgrowth SaaS.
  • Experience contributing to platform code or tooling (e.g. base images CLI/scaffolding controllers/operators admission/policy) multicluster or multiregion operations and progressive delivery.

We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.

Please see the independent bias audit report covering our use of Covey here


Required Experience:

Senior Manager

Klaviyo powers growth for thousands of businesses and our R&D teams build on shared platform primitives. As the Senior Manager Production Infrastructure youll lead the teams behind our paved roadscompute runtimes service networking/ingress and observabilityso product engineers can move fast on a sta...
View more view more

Key Skills

  • Account Management
  • Insurance Management
  • Import & Export
  • Catering Operations
  • Building Electrician
  • Financial Planning & Analysis

About Company

Company Logo

Klaviyo unifies AI-powered email marketing and SMS to drive growth, retention, and measurable results. Build personalized, omnichannel experiences across WhatsApp, ecommerce, and more with K:AI Agents.

View Profile View Profile