Who Youll Work With
SREs at Arista combine strong software and systems engineering with a passion for operating production systems at scale. As an SRE youll be part of the team responsible for our global service fleet.
What Youll Do:
CloudVision is deployed on Kubernetes across global regions using Spinnaker for our CI/CD pipeline. Our tech stack runs on GKE using HBase/Hadoop as main distributed database and storage layer ElasticSearch for powering search data ClickHouse for fast real time queries of flow data our own Kafka-based distributed real time stream processing layer for analytics and TensorFlow for ML analysis. Our monitoring system is built on top of Prometheus Grafana Loki and other OSS tools.
As a Senior SRE youll be responsible for our global CloudVision service fleet. This includes:
- Build deploy safely and incrementally and operate critical production systems with focus on scalability reliability observability performance and security.
- Monitor support and enhance product deployment experience across services.
- Build automation to remove toil and efficiently operate production systems.
- Proactively monitor respond to and enhance alerts and set up automated alert handling
- Create and maintain the incident response runbooks.
- Build and deploy new systems with scalability reliability and observability as primary requirements
- Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
- Deploy new systems in a staged manner
- Write postmortem documents and build solutions to avoid incidents from repeating.
- Plan and communicate maintenance windows on production systems.
- Work with Aristas product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
- Survey and adopt best practices around infrastructure/platform to maintain secure scalable and fault-tolerant systems.
- Implement solutions to scale the systems
- Implement fault-tolerance and performance to improve availability of the systems
- Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.
Qualifications :
- Bachelors in Computer Science or Engineering 5 years experience MS Computer Science or Engineering 5 years experience or equivalent work experience.
- Knowledge of one or more of Go Python bash shell scripting to be able to implement medium complexity automation workflows.
- Knowledge of Linux (or UNIX) from administration and debugging perspective
- Hands-on experience in operating software systems (infrastructure complex applications etc) at scale
- Experience in server provisioning (esp from storage and networking perspective).
- Strong problem solving and software troubleshooting skills
- Experience with infrastructure-as-code.
- Desirable to have one/more of the following skills
- Experience managing databases - eg: PostgreSQL or equivalent RDBMS etc
- Experience with docker and virtualization technologies
- Experience managing monitoring stack - Prometheus Grafana etc
- Experience managing Artifactory docker registry etc
- Experience managing CI/CD systems like GitLab tools Spinnaker etc
- Experience with infrastructure-as-code frameworks like Terraform
- Experience with container orchestration via Kubernetes
#LI-SZ1
Remote Work :
Yes
Employment Type :
Full-time
Who Youll Work WithSREs at Arista combine strong software and systems engineering with a passion for operating production systems at scale. As an SRE youll be part of the team responsible for our global service fleet.What Youll Do:CloudVision is deployed on Kubernetes across global regions using Spi...
Who Youll Work With
SREs at Arista combine strong software and systems engineering with a passion for operating production systems at scale. As an SRE youll be part of the team responsible for our global service fleet.
What Youll Do:
CloudVision is deployed on Kubernetes across global regions using Spinnaker for our CI/CD pipeline. Our tech stack runs on GKE using HBase/Hadoop as main distributed database and storage layer ElasticSearch for powering search data ClickHouse for fast real time queries of flow data our own Kafka-based distributed real time stream processing layer for analytics and TensorFlow for ML analysis. Our monitoring system is built on top of Prometheus Grafana Loki and other OSS tools.
As a Senior SRE youll be responsible for our global CloudVision service fleet. This includes:
- Build deploy safely and incrementally and operate critical production systems with focus on scalability reliability observability performance and security.
- Monitor support and enhance product deployment experience across services.
- Build automation to remove toil and efficiently operate production systems.
- Proactively monitor respond to and enhance alerts and set up automated alert handling
- Create and maintain the incident response runbooks.
- Build and deploy new systems with scalability reliability and observability as primary requirements
- Triage platform/infrastructural issues and help Arista software engineers in their triages. Engage with 3rd party vendor support.
- Deploy new systems in a staged manner
- Write postmortem documents and build solutions to avoid incidents from repeating.
- Plan and communicate maintenance windows on production systems.
- Work with Aristas product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. Design and implement solutions to resolve them.
- Survey and adopt best practices around infrastructure/platform to maintain secure scalable and fault-tolerant systems.
- Implement solutions to scale the systems
- Implement fault-tolerance and performance to improve availability of the systems
- Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.
Qualifications :
- Bachelors in Computer Science or Engineering 5 years experience MS Computer Science or Engineering 5 years experience or equivalent work experience.
- Knowledge of one or more of Go Python bash shell scripting to be able to implement medium complexity automation workflows.
- Knowledge of Linux (or UNIX) from administration and debugging perspective
- Hands-on experience in operating software systems (infrastructure complex applications etc) at scale
- Experience in server provisioning (esp from storage and networking perspective).
- Strong problem solving and software troubleshooting skills
- Experience with infrastructure-as-code.
- Desirable to have one/more of the following skills
- Experience managing databases - eg: PostgreSQL or equivalent RDBMS etc
- Experience with docker and virtualization technologies
- Experience managing monitoring stack - Prometheus Grafana etc
- Experience managing Artifactory docker registry etc
- Experience managing CI/CD systems like GitLab tools Spinnaker etc
- Experience with infrastructure-as-code frameworks like Terraform
- Experience with container orchestration via Kubernetes
#LI-SZ1
Remote Work :
Yes
Employment Type :
Full-time
View more
View less