drjobs Senior Customer Reliability Engineer (CRE)

Senior Customer Reliability Engineer (CRE)

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

The Opportunity

This is not a traditional operations role. You will inherit a set of critical manual and hands-on operational responsibilities essential to our customers success. We need you to help lead the effort to systematically dismantle this operational burden through automation tooling and systems. You will have a collaborative team of excellent engineers and a counterpart to you to work with on both the manual toil and the systems we need to engineer.

The short-term needs are: manual deployments reactive troubleshooting and on-call escalations. But we need you to help us build a system where programmatic solutions have replaced human intervention. You must have the pragmatism to manage the current reality and the systematic impatience to build its replacement.

Success in this role requires a dual mindset. You must be a skilled incident leader who can stabilize a crisis and a deliberate systems architect who can prevent the next one. You will work closely with our internal tools platform and product engineering teams to channel your direct operational knowledge into durable long-term solutions.

What Youll Do

Your work will follow a deliberate trajectory from reactive execution to proactive design.

Phase 1: Stabilize and Map (First 3-6 Months). You will embed with the team taking ownership of the existing operational workload alongside the other customer SRE person covering the India time zone and product engineers. This includes customer deployments upgrades and incident response. Your initial goal is to achieve stability while mapping the landscape of our operational toil.

Phase 2: Automate and Influence (Months 6-18). Armed with your map of toil you will begin to automate. You will write code build tooling and deploy declarative infrastructure to eliminate the most critical operational burdens. For larger projects you will act as a primary stakeholder providing clear requirements to our internal tooling and platform teams and ensuring their solutions meet the operational need. Your success will be measured by a demonstrable reduction in the overall support effort fewer pages support escalations and manual tasks.

Phase 3: Architect and Evangelize (Year 2). With the most acute operational pains addressed your focus will shift to architectural concerns. You will define and implement Service Level Objectives (SLOs) influence the design of new products for operability and help instill SRE principles throughout the engineering organization.


Qualifications :

  • DevOps and SRE Proficiency
    • You must have a strong background in Site Reliability Engineering or a closely related DevOps function. You also have a strong command of Linux systems administration and possess an understanding of networking fundamentals (TCP/IP DNS routing).
  • Customer-Facing Experience
    • You must have experience working directly with external customers to solve difficult technical problems. Your communication must be clear empathetic and precise.
  • Cloud Infrastructure Expertise
    • You need production experience with a major cloud provider preferably AWS. You should be proficient in its core concepts and services (VPC EC2 IAM S3) and have experience building and managing infrastructure as code with tools like Terraform.
  • Monitoring and Observability
    • You will be responsible for both building and using our observability stack. This requires hands-on experience instrumenting applications and managing the telemetry pipelines for metrics logs and traces.
    • A core part of the role is then applying this data to debug complex production incidents understand system behavior and define SLOs.
  • Automation and Software Development
    • You must be proficient in writing code to automate operational tasks. Expertise in a high-level language like Python or Go is required as are strong shell scripting skills (e.g. Bash). We have a diverse tech stack including Python Scala C Haskell Rust PureScript etc which requires experience with monitoring and debugging a complex system using system tools command line utilities networking debug tools and filtering complex logs.

Preferred Skills

  • Proficiency with Kafka Postgres nginx systemd etc is a plus
    • We use this software extensively in the product in customer environments. Experience here is not required but it is a plus.
  • Proficiency in Nix and NixOS is a plus
    • We use Nix/NixOS extensively so knowing them helps but they will not play a large role in your initial responsibilities.  Well train you on the job if youve never used Nix before.
  • Exposure to or proficiency in functional programming languages and paradigms is a plus
    • We value functional programming-oriented principles (compositionality immutability etc). You are not required to know functional languages but some exposure is a plus as is a willingness to learn but this is not a requirement.

Values

  • We value compassion
    • We believe our mission is one of service to others whether that is protecting our customers from harm or empowering other developers to do work they are proud of.
  • We value humility
    • Humility matters to everyone on the engineering team in Arista NDR and we accept the sobering reality that we as humans make mistakes forget things easily and have over-inflated confidence in our grasp of complexity. We value humility because we think it leads to better solutions (social or technological) and better understanding.
  • We value reliability
    • We believe that software (any form of automation really) should free people to do their most creative work. Therefore we value low maintenance software and technology that empowers us to write it.

This is a hybrid work environment where office presence maybe required 1-2 days a week.

#LI-SP1


Additional Information :

Arista Networks is an equal opportunity employer.  Arista makes all hiring and employment-related decisions in a non-discriminatory manner without regard to race color religion sex sexual orientation gender identity national origin or any other factor determined to be unlawful under applicable federal state or law law.  All your information will be kept confidential according to EEO guidelines.


Remote Work :

Yes


Employment Type :

Full-time

Employment Type

Remote

Company Industry

Department / Functional Area

Software Engineering

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.