drjobs Senior Site Reliability Engineer

Senior Site Reliability Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Toronto - Canada

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

About Tubi:

Boldly built for every fandom Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the worlds largest collection of Hollywood movies and TV shows thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014 Tubi is part of Tubi Media Group a division of Fox Corporation.

About the Role:

Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developers mindset and toolkit to the challenges of building and running large-scale distributed systems. Our mission is to engineer resilience from the ground up enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability latency performance and capacity of our platform and we achieve our goals through a culture of data-driven decision-making blameless learning and relentless automation.

As a Senior Site Reliability Engineer you are a hands-on engineer who blends deep software development expertise with a passion for operational excellence. You will be responsible for designing building and running the resilient scalable and increasingly self-healing systems that power our products. You will apply sound engineering principles to solve our most complex reliability challenges with a mandate to automate everything eliminate toil and write robust maintainable code. You will be a force multiplier mentoring other engineers and elevating the site reliability bar for the entire organization.

What Youll Do:

  • System Architecture & Design: Design build and maintain scalable highly available and fault-tolerant distributed systems. Partner with development teams as a reliability consultant reviewing designs and influencing architectural decisions to ensure new services are built with reliability observability and performance as core principles not afterthoughts.
  • Automation & Software Development: Write robust performant and maintainable code to automate operational tasks and CI/CD pipelines. Build the internal tools libraries and frameworks that enable engineering teams to self-service their observability needs reducing cognitive load and increasing their velocity.
  • Incident Response & Post-Mortem Analysis: Participate in a 24/7 on-call rotation acting as a key technical leader and incident commander during critical service disruptions. Conduct deep blameless root cause analyses (RCAs) that go beyond immediate fixes to identify and address systemic issues. Drive the implementation of corrective actions to prevent the recurrence of incidents.
  • Performance & Capacity Planning: Proactively monitor measure and optimize system performance to ensure low latency and high efficiency. Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding. Analyze usage patterns and historical data to forecast capacity needs ensuring our platform stays ahead of customer demand.

Your Background:

  • Bachelors degree in Computer Science a related technical field or equivalent practical experience.
  • 5 years of professional experience in a Site Reliability Engineering DevOps or Software Engineering role with a focus on infrastructure and operations.
  • Strong programming proficiency in one or more high-level languages such as Rust Go Python or Typescript. You should be comfortable writing testing and deploying production-grade code.
  • Deep knowledge of AWS services (especially networking IAM EKS ALBs/NLBs Route 53 CloudWatch).
  • Proven experience with Kubernetes in production (EKS preferred) including service exposure networking and availability engineering.
  • A solid understanding of Linux/Unix operating systems networking fundamentals (TCP/IP DNS HTTP) and the architecture of modern distributed systems.

Preferred Qualifications (Nice-to-Haves)

  • Experience building and managing large-scale monitoring and observability systems using tools like Datadog Prometheus Grafana etc.
  • Expertise in designing and implementing CI/CD pipelines using tools such as Github action ArgoCD etc.
  • Experience with distributed storage technologies (e.g. Amazon S3) and databases (e.g. PostgreSQL ScyllaDb Clickhouse etc.).
  • Contributions to open-source projects in the SRE DevOps or cloud-native ecosystem.

The AI Mandate: Building the Future of Observability with AI

As a Senior SRE you will be at the forefront of applying AI to solve our most critical reliability challenges. This is a hands-on software development role where the product you build is an intelligent automated reliability platform. Your responsibilities will include:

  • Building AI-Driven Automation: Building and integrating solutions that leverage our AIOps platform. This involves writing the code that consumes signals from the AI system correlates disparate data sources automates responses to AI-detected anomalies and builds self-healing systems triggered by predictive alerts. You will transform AI insights into concrete reliability improvements.
  • Leveraging AI for Code Development: Utilizing AI-assisted coding tools (e.g. Claude Code Cursor) as a force multiplier in your daily workflow. You will leverage these assistants to write high-quality automation scripts Terraform modules Kubernetes manifests and observability dashboards faster and more efficiently while applying your expertise to validate and refine their output.
  • Enriching our AI Knowledge Base: Developing and enriching our observability platforms internal knowledge base. You will be responsible for creating and documenting high-quality runbooks and procedural guides that can be ingested and used by AI assistants to provide context-aware troubleshooting guidance to the on-call engineer during an incident.
  • Applying Data Science to Reliability: Treating reliability as a data science problem. You will analyze vast sets of telemetry data to identify trends build predictive models for system capacity and proactively identify performance bottlenecks and potential failure modes before they can impact our users.

#LI-Hybrid

Tubi is a division of Fox Corporation and the FOX Employee Benefits summarizedherecovers the majority of all US employee following distinctions below outline the differences between the Tubi and FOXbenefits:

We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race color religion sex national origin gender identity disability protected veteran status or any other characteristic protected by law. We will consider for employment qualified applicants with criminal histories consistent with applicable law.


Required Experience:

Senior IC

Employment Type

Full Time

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.