Employer Active
Role Definition
This is a developed professional role for an AWS focused SRE. Individuals are responsible for basic reliability and toil reduction projects. At this level SREs can observe the performance of a system and configure proactive alerting to protect service levels. SREs are ready to join the oncall rotation. They can participate in disaster recovery tests in production environments. They may train new team members.
Scope and Key Responsibilities
Creates monitoring queries and establishes service level baselines
Supports senior engineers during incidents
Makes contributions during postmortems and RCAs
Participates in disaster recovery testing
Implements automation and executes code in production environments
Contributes to SRE knowledge documentation
TOP 3 musthave skills:
a.) AWS
b.) Windows/Linux
c.) Troubleshooting
Technical Skills
Observability: Level 3
Able to create proactive alert rules that detect conditions that are urgent and actionable so that alerts page support teams before users are impacted.
Can create and configure browser agents to monitor performance of apps including user satisfaction JavaScript errors session performance and core web vitals.
Can create complex synthetic transactions that includes scripts to simulate user flow and functionality from the browser or APIs endpoints.
Able to create advanced Application Performance Monitoring (APM) and Browser distributed traces that gives insights into application performance.
Able to recommend and create Service Level Objectives using the latency traffic errors and saturation Golden Signals
Incident Management: Level 3
Has the ability to create and/or present RCAs including the executive summary timeline detailed impact statement followon actions and residual risks.
Can lead scenario modelling exercises and the creation of workflows which are triggered by a breach of SLO
Able to participate on the oncall rotation and provide oncall support for other SRE engineers.
Can write advanced automation scripts for incident response including failovers and rollbacks.
Design for Reliability: Level 3
Can make theorical performance (latency traffic) and capacity recommendations based on customer demand and growth estimates
Has good knowledge of DevOps practices including monitoring virtual networks cloud storage containers and orchestration CI/CD configuration management and securing cloud applications
Disaster Recovery: Level 3
Capable of participating oncall to assist in the recovery of Major Incidents (for production environments)
Can test system and component failover within and between geographic regions (for production environments)
Able to automate the recovery of systems and components using InfrastructureasCode and Configuration Management scripts.
Platforms and Automation: Level 3
Able to identify opportunities to improve the developer experience through leveraging using observability tools paved road components shared services and selfservice portals.
Able to improve software delivery performance by recommending and/or implementing automated build and release processes and removing manual tasks
Able to maintain and secure cloud environments such that it doesnt impact software delivery performance.
Reliability Culture: Level 3
Can contribute to SRE knowledge base articles and training material.
Able to analyze toil by looking at ticket trends and can make recommendation for the team on focus areas.
Can independently work on small toil elimination projects.
Behavioral Competencies
Collaboration and Teamwork
Customer & External Focus
Solves Problems and Analyses Issues
Learning Agility
Full Time