drjobs Foundations Site Reliability Engineer

Foundations Site Reliability Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Seattle - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

SRE Foundations Site Reliability Engineer (Contract)

About this team Site Reliability Engineering

We are looking for a motivated engineer to join the Foundations team which is responsible for observability and monitoring in Site Reliability Engineering guiding the digital organization to improve the practice of reliability here. We are a consultative enablement team providing guidance and support to product engineering teams for the development of high-quality and resilient software systems through the use of monitoring tools and practices. SRE partners with many product engineering teams across digital and beyond to infuse the concepts and practices of reliability into engineering process and deliverables. The Foundations team owns the management of our monitoring tools and the best practices for using those tools to provide total visibility into our systems. This role requires a vision and strategy for monitoring and how to manage it across a disparate organization.

As a SRE Engineer you will be responsible for designing implementing and maintaining robust monitoring solutions creating insightful dashboards identifying relevant metrics and driving efficient problem management practices. You will help identify observability maturity opportunities and roadblocks to success for digital teams and clearing those roadblocks. You will partner closely with Product Owners and Scrum Masters to manage scope and strike a balance between support and investment work. You are expected to clearly communicate risks to your partners for deliverables.

A day in the life

As an Engineer II on the SRE Foundations team you are a technical contributor and domain leader in observability and reliability. Your day-to-day responsibilities include:

  • Observability & Monitoring

    • Design implement and optimize observability solutions across metrics logging and tracing.

    • Build and maintain dashboards and alerts (e.g. Datadog) that provide meaningful insight into system health and performance.

    • Define and support adoption of Service Level Objectives (SLOs) Indicators (SLIs) and error budgets.

  • Incident & Problem Management

    • Participate in and lead incident response efforts during major outages and critical events.

    • Support on-call rotations particularly during key business events (e.g. product launches holiday traffic).

    • Conduct and contribute to Root Cause Analyses (RCAs) and post-incident reviews driving follow-up actions and long-term remediation plans.

    • Collaborate with partner teams to enhance incident playbooks reduce mean time to detect (MTTD) and resolve (MTTR) and improve operational readiness.

    • Apply principles of the ITIL framework in areas such as incident problem and change management ensuring alignment with organizational reliability goals.

  • Team Collaboration & Enablement

    • Partner with digital product teams to integrate observability best practices into their development and deployment workflows.

    • Identify tooling and knowledge gaps; champion improvements and automation initiatives that reduce toil and increase visibility.

    • Support product owners and engineering leads with prioritization between support investment and innovation work.

    • Mentor junior team members and advocate for team-wide knowledge sharing and continuous improvement.

  • Continuous Improvement & Strategic Contribution

    • Stay up to date with SRE and observability trends helping to evaluate and adopt new tools and approaches.

    • Contribute to domain-level standards and practices within the broader technology organization.

    • Influence reliability strategy by sharing insights performance metrics and whats working/whats not feedback with senior engineers and technical leadership.

Qualifications
  • Bachelors degree in Computer Science Engineering or equivalent experience.

  • 8 12 years of experience in software engineering or SRE with deep exposure to observability and monitoring.

  • Strong experience with observability tools such as Datadog Splunk and distributed tracing frameworks.

  • Proven track record in incident management RCA facilitation and on-call response - especially during critical peak traffic events.

  • Understanding of ITIL concepts including Incident Problem and Change Management.

  • Experience building and maintaining dashboards alerts and SLOs/SLIs.

  • Strong debugging and root cause analysis skills across complex distributed systems.

  • Excellent collaboration documentation and communication skills.

  • Familiarity with infrastructure-as-code (e.g. Terraform) Kubernetes and cloud-native systems.

  • Relevant certifications (e.g. Certified Kubernetes Administrator Terraform Associate) are a plus.

Bonus

  • Deep expertise in observability tooling (Datadog Splunk).

  • Prior experience in e-commerce or high-availability digital platforms.

  • Background in product ownership or leading reliability-focused initiatives.

Must haves
  • Acknowledges the presence of choice in every moment and takes personal responsibility for their life.

Required Skills : TerraformKubernetesSplunk

Basic Qualification :

Additional Skills :

Background Check : No

Drug Screen : No

Employment Type

Full-time

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.