Senior Platform & Reliability Engineer (SRE)

Vizcom

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend Node/Koa PostGraphile API services PostgreSQL Redis BullMQ queues and Kubernetes-based production infrastructure.

Were hiring a senior owner of stability and infrastructure to ensure the platform is reliable fast and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents reduce blast radius when failures happen and lead fast high-quality recovery when production degrades.

This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.


What Youll Own

  • Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

  • Production architecture resilience: Drive failure isolation across API workers queues and dependencies so one subsystem cannot take down core access.

  • Kubernetes runtime reliability: Define probe contracts rollout/rollback standards graceful shutdown behavior scaling/resource policies and startup safety.

  • Queue job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

  • Incident command quality: Lead Sev1/Sev2 response end-to-end (containment communications technical direction RCA corrective action execution).

  • Reliability operating system: Own observability quality (signals over noise) on-call effectiveness runbooks and postmortem discipline.

  • Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits Were Looking For

  • Calm structured incident commander under pressure.

  • Thinks in failure modes and blast radius by default.

  • Pragmatic: can stabilize quickly then implement durable fixes.

  • High ownership and strong written communication.

First 90 Days

  • Establish baseline reliability metrics and identify top platform risks.

  • Tighten incident response mechanics (roles comms cadence runbooks status updates).

  • Deliver high-impact hardening fixes across probes/startup paths/queue safety.

  • Publish a prioritized 612 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to :

1) what failed

2) how you contained it

3) what permanent fixes you shipped and measured.


Required Experience:

Senior IC

About VizcomVizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend Node/Koa PostGraphile API services PostgreSQL Redis BullMQ queues and Kubernetes-based production infrastructure.Were hiring a senior owner of st...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting