About SandboxAQ
SandboxAQ is a high-growth company delivering AI solutions that address some of the worlds greatest challenges. The companys Large Quantitative Models (LQMs) power advances in life sciences financial services navigation cybersecurity and other sectors.
We are a global team that is tech-focused and includes experts in AI chemistry cybersecurity physics mathematics medicine engineering and other specialties. The company emerged from Alphabet Inc. as an independent growth capital-backed company in 2022 funded by leading investors and supported by a braintrust of industry leaders.
At SandboxAQ weve cultivated an environment that encourages creativity collaboration and impact. By investing deeply in our people were building a thriving global workforce poised to tackle the worlds epic challenges. Join us to advance your career in pursuit of an inspiring mission in a community of like-minded people who value entrepreneurialism ownership and transformative impact.
About the Role
As a Senior Staff Site Reliability Engineer at SandboxAQ you will be responsible for maintaining and improving the reliability performance and scalability of our infrastructure and services. You will work closely with engineering teams to ensure that our systems are resilient highly available and optimized for performance. Your expertise will guide the development of reliable software and you will play a key role in shaping the reliability culture within the organization.
What Youll Do
- Incident Management: Lead efforts in incident response root cause analysis and postmortem processes while developing strategies to minimize incidents and reduce recovery times.
- Capacity Planning: Analyze system performance and growth trends and create capacity plans to ensure systems scale appropriately as demand increases.
- Monitoring & Observability: Design and maintain comprehensive monitoring logging and alerting solutions to ensure quick detection and resolution of system anomalies.
- Collaboration with Engineering Teams: Partner with software engineers product teams and DevOps to design systems that are both reliable and performant.
- Cost Optimization: Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance.
- Automation & Tools Development: Build and improve automation tools monitoring systems and deployment pipelines to streamline operations and increase efficiency.
- Mentorship & Leadership: Mentor junior and mid-level engineers providing technical leadership and guidance on SRE best practices incident management and system design.
- On-Call Rotation: Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems.
About You
- 10 years of experience in Site Reliability Engineering DevOps or similar roles.
- Strong experience with cloud platforms (AWS GCP or Azure) containerization (Docker Kubernetes) and infrastructure-as-code (Terraform CloudFormation).
- Proven ability to lead post-incident reviews and drive continuous improvement in system reliability.
- Excellent communication and collaboration skills with the ability to work across cross-functional teams.
- Expertise in systems administration networking and security in a cloud-native environment.
- Deep understanding of monitoring observability and logging tools (Prometheus Grafana ELK Datadog etc.).
- Proficiency in scripting languages (e.g. Python Go Bash) and configuration management tools (e.g. Ansible Chef Puppet).
- Experience designing and implementing scalable and reliable microservices architectures.
- Strong knowledge of CI/CD pipelines and related tools (CircleCIJenkins GitLab etc.)
Nice to Haves
- Experience with large-scale distributed systems and databases (e.g. Kafka PostgreSQL Cassandra MySQL).
- Experience with service mesh (e.g. Istio Linkerd) and serverless architectures.
- Strong understanding of compliance and security frameworks.
- Familiarity with chaos engineering practices and tools (e.g. Gremlin Chaos Monkey).
The US base salary range for this full-time position is expected to be $183k-$280k per year. Our salary ranges are determined by role and level. Within the range individual pay is determined by factors including job-related skills experience and relevant education or training. This role may be eligible for annual discretionary bonuses and equity.
SandboxAQ welcomes all.
We are committed to creating an inclusive culture where we have zero tolerance for discrimination. We invest in our employees personal and professional growth. Once you work with us you cant go back to normalcy because great breakthroughs come from great teams and we are the best in AI and quantum technology.
We offer competitive salaries stock options depending on employment type generous learning opportunities medical/dental/vision family planning/fertility PTO (summer and winter breaks) financial wellness resources 401(k) plans and more.
Equal Employment Opportunity: All qualified applicants will receive consideration regardless of race color ancestry religion sex national origin sexual orientation age citizenship marital status disability gender identity or Veteran status.
Accommodations: We provide reasonable accommodations for individuals with disabilities in job application procedures for open roles. If you need such an accommodation please let a member of our Recruiting team know.
Required Experience:
Staff IC