Our Opportunity:
Site Reliability Engineers Observability Team are a cross-functional group of systems and software engineers responsible for the operational aspects of Chewys e-commerce platform. The team designs builds and maintains Chewys observability platformcovering metrics logging and tracingwhile also supporting the infrastructure behind both internet-facing and internal services.
Were looking for engineers who want to contribute to developing infrastructure software maintaining it and scaling Chewys technology stack. Come help us build a bigger and better Chewy as a Site Reliability Engineer! You will be part of a small team with a huge impact on our incredible growth. Ideal candidates can clearly communicate complex technical concepts with diverse audiences across the organization. They remain calm under pressure and bring structure to high-pressure fast-paced tasks and projects.
What Youll Do:
- Experience coding in one or more programming languages (e.g. Java or Python) with a solid foundation in software design
- Hands-on experience with OpenTelemetry collector Datadog and Dynatrace integrations with familiarity in creating metrics using StatsD and Prometheus
- Experience and familiarity with FluentBit/Fluentd log pipelines ensuring scalable and reliable log processing
- Worked on Jenkins release processes and Kubernetes running the apps under the guidance of senior engineers
- Strong understanding of monitoring logging and tracing data to improve engineering teams ability to optimize customer-facing services
- Identify requirements for other operational teams (release engineering automation etc.) during the application development phase
- Act as a technology and DevOps engineer to improve the automation areas on the observability platforms
- Participate in the on-call rotation for Level 1-2 support critical issues
What Youll Need:
- Bachelors degree 4 years of experience or masters degree 2 years of experience
- Hands-on experience developing coding skills in Java and infrastructure scripting (e.g. Terraform) for automation and observability enhancements
- Minimum 2 years of experience building and managing applications in public cloud platforms such as AWS (preferred) or GCP
- Experience working with the open-source community (e.g. troubleshooting patch submission)
- Strong ability to organize troubleshoot and continuously learn
Bonus (if applicable):
- Deep expertise in Datadog / Dynatrace / Splunk or any open source eco system environments