Senior Site Reliability Engineer DevOps Engineer

Prophet Town

Not Interested
Bookmark
Report This Job

profile Job Location:

Mountain View, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 5 hours ago
Vacancies: 1 Vacancy

Job Summary

Senior Site Reliability Engineer (SRE) / DevOps Engineer


Location: Onsite - Mountain View CA
Experience Required: 5 years
Infrastructure Footprint: Global production infrastructure across AWS South America and Europe
Role Type: Hands-on engineering role


Role Overview

Seeking a Senior Site Reliability Engineer / DevOps Engineer to design scale and operate highly available global infrastructure supporting production systems across multiple international regions.

This role is for an engineer with 5 years of experience building and running production-grade cloud infrastructure. The right person understands where distributed systems fail and has learned the hard lessons that come from operating Kubernetes and cloud platforms at scale.

The ideal candidate has deep hands-on experience with Kubernetes ArgoCD Terraform CI/CD pipelines AWS infrastructure and multi-region platform reliability. They should understand the limitations sharp edges and operational failure modes of these tools.

This is an onsite role working closely with platform engineering and leadership to build resilient global infrastructure.

What Youll Do

Global Infrastructure Architecture

  • Design and operate globally distributed production infrastructure across AWS regions and physical data center environments in South America and Europe
  • Build highly available multi-region systems with strong disaster recovery and failover strategies
  • Solve cross-region networking latency DNS routing replication and reliability challenges

Kubernetes Platform Engineering

  • Build scale secure and troubleshoot production Kubernetes clusters
  • Handle cluster lifecycle management upgrades node failures networking issues storage problems and control-plane troubleshooting
  • Tune workloads for resiliency scheduling efficiency autoscaling behavior and resource optimization
  • Debug real-world Kubernetes issues including:
    • etcd instability
    • networking overlays and CNI failures
    • ingress/controller edge cases
    • persistent volume failures
    • node pressure and eviction behavior
    • cluster upgrade regressions

GitOps / ArgoCD Operations

  • Design and maintain GitOps workflows using ArgoCD
  • Manage promotion pipelines across environments and regions
  • Resolve drift detection issues sync conflicts reconciliation failures and deployment ordering challenges
  • Build safe rollback and progressive deployment strategies

Candidates should know why ArgoCD breaks not just how to click Sync.

Infrastructure as Code

  • Build and maintain reusable Terraform modules for multi-region infrastructure
  • Manage state strategy workspace isolation secrets handling and provider complexity
  • Solve real-world Terraform pain points including:
    • state corruption and locking conflicts
    • module version drift
    • provider upgrade regressions
    • dependency graph surprises
    • cross-account provisioning complexity

CI/CD Engineering

  • Build and optimize production CI/CD pipelines
  • Improve deployment speed safety and repeatability
  • Troubleshoot flaky pipelines artifact inconsistencies race conditions environment drift and rollback failures

Reliability & Observability

  • Establish SLIs/SLOs and production health standards
  • Build alerting monitoring tracing and incident response workflows
  • Lead root cause analysis and postmortem improvements
  • Reduce operational toil through automation

Why This Role

Youll own foundational infrastructure decisions for globally distributed systems and help build resilient platform capabilities at international scale.

This is a hands-on engineering role for someone who wants meaningful ownership and complex technical problems.



Requirements

Required Experience

  • 5 years in Site Reliability Engineering DevOps or Platform Engineering

  • Deep production experience with:

    • Kubernetes

    • ArgoCD

    • Terraform

    • AWS

    • CI/CD systems

    • Linux systems administration

    • Infrastructure automation


Preferred Experience

  • Experience operating infrastructure across multiple continents
  • Experience with hybrid cloud or physical data center integration
  • Strong networking knowledge including BGP VPNs routing DNS and load balancing
  • Experience with security hardening and compliance in production systems
  • Software engineering background with Go Python or Bash

What Senior Means Here

You have enough production experience to have strong opinions because you have seen failures firsthand.

You know:

  • why Terraform plans sometimes lie

  • why ArgoCD syncs can fail for non-obvious reasons

  • why Kubernetes upgrades can ruin your week

  • why works in staging means very little

  • why multi-region failover diagrams often fail in production

  • why observability usually breaks exactly when needed most

Youve solved these problems repeatedly and improved systems because of those lessons.



Senior Site Reliability Engineer (SRE) / DevOps EngineerLocation: Onsite - Mountain View CAExperience Required: 5 yearsInfrastructure Footprint: Global production infrastructure across AWS South America and EuropeRole Type: Hands-on engineering roleRole OverviewSeeking a Senior Site Reliability Engi...
View more view more