Ready to shape the future of AI infrastructure and build systems that power the most advanced unstructured data pipelines in the world
At Unstructured were building the backbone of generative AIenabling companies to transform PDFs HTML Word docs images and more into high-performance data pipelines that scale. Our tools are already used by half of the Fortune 500 and our open-source package has been downloaded 26 million times. Now were entering our next chapterand were hiring a Site Reliability Engineer to help scale our systems and safeguard our infrastructure.
If youre energized by reliability love solving infrastructure challenges at scale and want to help define how modern AI systems run in production this is your moment. Youll work closely with Engineering Product and Customer teams to build scalable systems streamline CI/CD and make reliability a first-class citizen across everything we deploy.
This role is hybrid in San Franciscojoin us in-office 3x a week for deep collaboration whiteboard sessions and hands-on impact.
What Youll Own & Drive
Scale & Stability at the Core
Design and implement highly available observable and scalable infrastructure across cloud environments
Build resilient systems that meet the demands of enterprise-grade production AI workloads
Automate Everything
Develop Infrastructure-as-Code using Terraform Pulumi and others
Own CI/CD automation and build reusable pipelines with GitHub Actions and modern DevOps tooling
Own Kubernetes & Orchestration
Manage and optimize our Kubernetes clusters and containerized environments
Tune Helm charts service mesh configs and orchestration systems for performance and security
Obsess Over Observability
Implement and maintain monitoring logging and alerting with tools like Prometheus Grafana Datadog and Elastic
Ensure we can see understand and respond to system behavior in real-time
Drive Production Readiness
Partner with engineering to prepare features and systems for production rollouts
Contribute to capacity planning deployment strategies and fault-tolerant system design
Lead Incident Response
Support and lead incident response processes postmortems and root cause analysis
Champion a culture of blameless retrospectives and continuous improvement
Accelerate Engineering Velocity
Improve developer experience through tooling automation and streamlined feedback loops
Help teams move faster without sacrificing quality or uptime
What You Bring
-4 years in SRE DevOps or Infrastructure Engineering roles supporting high-scale production environments
-Deep experience with cloud platforms like AWS GCP or Azure
-Expertise in Kubernetes Docker and container orchestration at scale
-Strong Linux systems and networking fundamentals
-Scripting and automation skills (Python Bash or Go preferred)
-Proficiency with Infrastructure-as-Code (Terraform Pulumi Ansible or similar)
-Solid understanding of monitoring and observability best practices
-A calm systems-thinking approach to incident response and reliability
Bonus Points
-Experience supporting ML infrastructure or real-time data pipelines
-Exposure to serverless or event-driven architectures
-Contributions to open-source DevOps projects or communities
-Familiarity with security and compliance in cloud-native environments
Why Youll Love It Here
Impact That Matters: Own the core infrastructure behind AI systems used by the Fortune 500
Big Technical Challenges: Solve hard meaningful problems at the cutting edge of cloud and data
Elite Team: Join a sharp humble group of engineers who value execution and impact
SF Office Vibes: Collaborate live with real whiteboards and real humans (not just Slack threads)
Flexible Culture: Hybrid structure with async-friendly low-ego collaboration