As a handson principal for compute networking storage runtimes (e.g. Kubernetes) CI/CD and observability youll architect the service platform that lets teams ship fast and safely. IC roleno direct reportsyou lead via design code and incident excellence setting technical standards and SLOs for platform services.
What Youll Do
- Architect and evolve the Kubernetes platform service mesh networking storage and CI/CD pipelines; ship golden paths and IaC modules.
- Define platform SLOs; use error budgets to guide reliability vs. velocity tradeoffs; drive incident learning and readiness reviews.
- Improve developer velocity (build/deploy times flaky tests local dev ergonomics) with measurable results.
- Lead capacity planning and commitments; build guardrails for cost security and compliance with Security/FinOps partners.
- Write highimpact code automation and tooling; mentor across teams and raise the bar on operational excellence
- Embed AI in the developer experiencefrom code generation to observability and incident responseso teams ship faster and safer by default.
Who You Are
- Experience: 10 years building and operating cloud platforms (compute networking storage runtimes like Kubernetes) with a track record of multiregion HA and SLO rigor.
- Technical expertise: Deep in Kubernetes service mesh Terraform/IaC CI/CD and production observability; you ship golden paths and guardrails that lift the whole org.
- Experience with databases and storage systems including SQL and NoSQL databases and object block or file storage platforms.
- AI tools & automation: Youve brought AI into platform engineeringfrom copilotassisted workflows and intelligent test generation to AIOps for incident triage anomaly detection and runbook automationwith clear security and cost boundaries.
- Ops leadership: You lead via design reviews incident excellence and SLO/errorbudget tradeoffs communicated in business terms.
- AI fluency: Youre handson with AI tools and help teams adopt them responsibly.
Nice to Haves
- Core SLOs & velocity: 99.95% SLOs for core services; 2550% faster build/deploy times; developerreported friction trending down.
- AIenabled platform: Approved AI tooling is integrated into IDE/CI/CD with repo policies and auditability; 70% MAU among eligible engineers; MTTR down 2030% via AIassisted triage; flakytest rate decreases through targeted AIsuggested fixes.
- Guardrails in place: Cost security and compliance controls are codified as IaC modules and enforced in paved roads.
- Experience with enterprise governance including compliance and audit requirements.
- Familiarity with GDPR and data privacy considerations in large-scale production environments.
Success in 612 Months
- 99.95% SLOs for core services; 2550% faster build/deploy times; reduced developerreported friction; incident recurrence trending down.
We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.
Please see the independent bias audit report covering our use of Covey here
Required Experience:
Staff IC
As a handson principal for compute networking storage runtimes (e.g. Kubernetes) CI/CD and observability youll architect the service platform that lets teams ship fast and safely. IC roleno direct reportsyou lead via design code and incident excellence setting technical standards and SLOs for platfo...
As a handson principal for compute networking storage runtimes (e.g. Kubernetes) CI/CD and observability youll architect the service platform that lets teams ship fast and safely. IC roleno direct reportsyou lead via design code and incident excellence setting technical standards and SLOs for platform services.
What Youll Do
- Architect and evolve the Kubernetes platform service mesh networking storage and CI/CD pipelines; ship golden paths and IaC modules.
- Define platform SLOs; use error budgets to guide reliability vs. velocity tradeoffs; drive incident learning and readiness reviews.
- Improve developer velocity (build/deploy times flaky tests local dev ergonomics) with measurable results.
- Lead capacity planning and commitments; build guardrails for cost security and compliance with Security/FinOps partners.
- Write highimpact code automation and tooling; mentor across teams and raise the bar on operational excellence
- Embed AI in the developer experiencefrom code generation to observability and incident responseso teams ship faster and safer by default.
Who You Are
- Experience: 10 years building and operating cloud platforms (compute networking storage runtimes like Kubernetes) with a track record of multiregion HA and SLO rigor.
- Technical expertise: Deep in Kubernetes service mesh Terraform/IaC CI/CD and production observability; you ship golden paths and guardrails that lift the whole org.
- Experience with databases and storage systems including SQL and NoSQL databases and object block or file storage platforms.
- AI tools & automation: Youve brought AI into platform engineeringfrom copilotassisted workflows and intelligent test generation to AIOps for incident triage anomaly detection and runbook automationwith clear security and cost boundaries.
- Ops leadership: You lead via design reviews incident excellence and SLO/errorbudget tradeoffs communicated in business terms.
- AI fluency: Youre handson with AI tools and help teams adopt them responsibly.
Nice to Haves
- Core SLOs & velocity: 99.95% SLOs for core services; 2550% faster build/deploy times; developerreported friction trending down.
- AIenabled platform: Approved AI tooling is integrated into IDE/CI/CD with repo policies and auditability; 70% MAU among eligible engineers; MTTR down 2030% via AIassisted triage; flakytest rate decreases through targeted AIsuggested fixes.
- Guardrails in place: Cost security and compliance controls are codified as IaC modules and enforced in paved roads.
- Experience with enterprise governance including compliance and audit requirements.
- Familiarity with GDPR and data privacy considerations in large-scale production environments.
Success in 612 Months
- 99.95% SLOs for core services; 2550% faster build/deploy times; reduced developerreported friction; incident recurrence trending down.
We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3 2025.
Please see the independent bias audit report covering our use of Covey here
Required Experience:
Staff IC
View more
View less