Senior Platform & Reliability Engineer (SRE)

San Francisco, CA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend Node/Koa PostGraphile API services PostgreSQL Redis BullMQ queues and Kubernetes-based production infrastructure.

Were hiring a senior owner of stability and infrastructure to ensure the platform is reliable fast and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents reduce blast radius when failures happen and lead fast high-quality recovery when production degrades.

This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

What Youll Own

Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

Production architecture resilience: Drive failure isolation across API workers queues and dependencies so one subsystem cannot take down core access.

Kubernetes runtime reliability: Define probe contracts rollout/rollback standards graceful shutdown behavior scaling/resource policies and startup safety.

Queue job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

Incident command quality: Lead Sev1/Sev2 response end-to-end (containment communications technical direction RCA corrective action execution).

Reliability operating system: Own observability quality (signals over noise) on-call effectiveness runbooks and postmortem discipline.

Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits Were Looking For

Calm structured incident commander under pressure.
Thinks in failure modes and blast radius by default.
Pragmatic: can stabilize quickly then implement durable fixes.
High ownership and strong written communication.

First 90 Days

Establish baseline reliability metrics and identify top platform risks.

Tighten incident response mechanics (roles comms cadence runbooks status updates).

Deliver high-impact hardening fixes across probes/startup paths/queue safety.

Publish a prioritized 612 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to :

1) what failed

2) how you contained it

3) what permanent fixes you shipped and measured.

Required Experience:

Senior IC

About VizcomVizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend Node/Koa PostGraphile API services PostgreSQL Redis BullMQ queues and Kubernetes-based production infrastructure.Were hiring a senior owner of st...

About Vizcom

Were hiring a senior owner of stability and infrastructure to ensure the platform is reliable fast and resilient as we scale.

Role Mission

What Youll Own

Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

Production architecture resilience: Drive failure isolation across API workers queues and dependencies so one subsystem cannot take down core access.

Kubernetes runtime reliability: Define probe contracts rollout/rollback standards graceful shutdown behavior scaling/resource policies and startup safety.

Queue job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

Incident command quality: Lead Sev1/Sev2 response end-to-end (containment communications technical direction RCA corrective action execution).

Reliability operating system: Own observability quality (signals over noise) on-call effectiveness runbooks and postmortem discipline.

Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits Were Looking For

Calm structured incident commander under pressure.
Thinks in failure modes and blast radius by default.
Pragmatic: can stabilize quickly then implement durable fixes.
High ownership and strong written communication.

First 90 Days

Establish baseline reliability metrics and identify top platform risks.

Tighten incident response mechanics (roles comms cadence runbooks status updates).

Deliver high-impact hardening fixes across probes/startup paths/queue safety.

Publish a prioritized 612 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to :

1) what failed

2) how you contained it

3) what permanent fixes you shipped and measured.

Required Experience:

Senior IC

Key Skills

Apply Now

About Company

Vizcom

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Senior Platform & Reliability Engineer (SRE)

San Francisco, CA - USA

Job Summary

First 90 Days

First 90 Days

Key Skills

About Company

Related Jobs