Site Reliability Engineer W2 Role
Palo Alto, CA - USA
Job Summary
Role: Site Reliability Engineer (SRE)
Location: Palo Alto CA (Onsite from Day 1)
Job Type: Contract (W2)
Skill Matrix:
| Name | Required |
| Programming | Yes |
| SRE | Yes |
| Grafana | Yes |
| Prometheus | Yes |
| AWS | Yes |
| Cloud Infrastructure | Yes |
| Linux | Yes |
| UNIX | Yes |
Top skills required for this role:
Programming: Proficiency in languages like Python Java or Go.
System Administration: Strong understanding of Linux/Unix systems.
Cloud Infrastructure: Experience with AWS
Infrastructure as Code (IaC): Knowledge of tools like Terraform or Ansible.
Monitoring Tools: Proficiency with tools such as Prometheus Grafana or Datadog
Job Description/ Responsibilities:
Automation and Tooling: SREs write code to automate operational tasks such as provisioning configuration changes and system updates to reduce manual work and human error.
System Monitoring and Alerting: Developing and maintaining observability stacks (logs metrics tracing) to proactively detect issues before they impact users.
Incident Response and On-Call: Managing 24/7 on-call rotation to respond to troubleshoot and resolve production incidents.
Post-Incident Reviews (Postmortems): Conducting blameless in-depth reviews of incidents to identify root causes and implement preventive measures.
Capacity Planning: Analyzing system resource utilization to ensure infrastructure can scale to handle future load requirements.
Performance Optimization: Identifying and fixing bottlenecks in software and infrastructure to improve system efficiency and responsiveness.
Error Budget Management: Setting and managing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to determine if a service is reliable enough to allow new feature deployments.
Chaos Engineering: Testing system resilience by intentionally introducing failures to ensure systems are fault-tolerant
Years of Experience: 8 Years of Experience