If youre passionate about building a better future for individuals communities and our countryand youre committed to working hard to play your part in building that futureconsider WGU as the next step in your career.
Driven by a mission to expand access to higher education through online competency-based degree programs WGU is also committed to being a great place to work for a diverse workforce of student-focused professionals. The university has pioneered a new way to learn in the 21st century one that has received praise from academic industry government and media leaders. Whatever your role working for WGU gives you a part to play in helping students graduate creating a better tomorrow for themselves and their families.
The salary range for this position takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs.
At WGU it is not typical for an individual to be hired at or near the top of the range for their position and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is:
Grade: Management Technical 715
Pay Range: $ - $
Job Description
The Senior Manager of Site Reliability Engineering (SRE) leads the function responsible for ensuring that critical systems and services are reliable scalable and resilient. The role combines technical leadership with organizational management directing SRE teams in designing implementing and operating infrastructure that supports business needs. This position defines service reliability standards drives incident response practices oversees automation initiatives and partners with other engineering and product teams to balance reliability with delivery velocity. This positions main objective is to improve reliability performance and operational efficiency to ensure our students and faculty are delighted with the fully online educational experience.
Primary Responsibilities
- Leads and mentors SRE teams creating an environment that encourages ownership collaboration and continuous improvement.
- Establishes the SRE vision goals and operational strategies in alignment with organizational objectives.
- Defines reliability roadmaps and communicate priorities to engineering and executive stakeholders.
- Develops drives and supports Service Level Objectives (SLOs) Indicators (SLIs) and Agreements (SLAs) across systems.
- Directs incident management processes including response coordination root cause analysis and follow-up actions.
- Implements practices that reduce downtime and ensure systems meet availability scalability and performance expectations.
- Drives adoption of infrastructure as code CI/CD pipelines and automated testing to improve operational efficiency.
- Oversees monitoring alerting and observability systems that provide insight into service health.
- Evaluates and implements emerging tools that enhance service reliability and reduce manual toil.
- Collects and evaluates system and application data to improve the performance and reliability of the environment proactively.
- Partners with software engineering security and product teams to integrate reliability into all development lifecycle phases.
- Provides senior leadership and other stakeholders with transparent reporting on reliability trends risks and improvement initiatives.
- Fosters a culture of blameless postmortems and shared accountability for uptime and performance.
- Promotes best practices for resilience scalability and disaster recovery.
- Regularly assesses and improves reliability processes and team workflows.
- Stays informed of evolving technologies and practices in SRE DevOps AI Machine Learning and cloud infrastructure.
- Performs other related duties as assigned.
This job description includes a general representation of job requirements rather than a comprehensive inventory of all required responsibilities or work activities. The contents of this document or related job requirements may change at any time with or without notice.
Qualifications
Knowledge Skills and Abilities
- Strong understanding of distributed systems cloud-native architectures and infrastructure design.
- Deep familiarity with cloud service providers (AWS GCP Azure) and their reliability and security best practices.
- Knowledge of software development lifecycles DevOps principles and SRE practices such as SLOs SLIs and error budgets.
- Understanding of networking storage and systems performance concepts.
- Knowledge of compliance data security and regulatory requirements relevant to system reliability and operations.
Skills
- Technical proficiency in infrastructure as code automation frameworks and modern programming/scripting languages (Python Go Bash etc.).
- Expertise in monitoring logging and observability platforms (Prometheus Grafana Datadog Splunk etc.).
- Skilled in incident management root cause analysis and postmortem processes.
- Strong leadership and people management skills with experience developing and scaling technical teams.
- Effective communication skills including the ability to explain technical concepts to both engineers and executives.
- Strong problem-solving prioritization and decision-making skills under pressure.
Abilities
- Ability to balance short-term operational needs with long-term reliability and scalability goals.
- Ability to foster a culture of reliability accountability and continuous improvement within technical teams.
- Ability to collaborate across engineering product and business teams to align reliability efforts with strategic goals.
- Ability to anticipate system weaknesses and proactively design resilience into infrastructure and applications.
- Ability to lead through influence driving adoption of SRE practices across the organization.
- Ability to adapt to evolving technologies industry practices and organizational needs.
- Bachelors or Masters degree in Computer Science Engineering or a related technical field or equivalent professional experience.
Experience
- 8 years of experience in Software Engineering/Development with some knowledge of SRE
- 3 years of experience managing or leading technical teams preferably in a reliability or infrastructure-focused capacity.
- Proven track record of delivering reliable scalable systems in complex environments.
- Strong expertise with cloud platforms such as AWS GCP or Azure.
- Hands-on experience with Kubernetes container orchestration and microservices architectures.
- Proficiency with infrastructure as code and automation tools (Terraform Ansible Pulumi etc.).
- Solid programming or scripting ability in Python Go Java JavaScript and/or Bash.
- Deep understanding of monitoring logging and observability systems (e.g. New Relic Grafana Datadog Splunk Dynatrace).
- Experience implementing and managing SLOs SLIs and SLAs to measure and improve service reliability.
- Leadership Qualifications
- Demonstrated ability to build mentor and lead high-performing engineering teams.
- Strong communication skills with the ability to engage technical teams and executive leadership.
- Ability to balance immediate operational demands with long-term reliability strategy.
- Experience fostering a blameless culture of incident management and continuous improvement.
- Strategic mindset with the ability to align technical priorities to business goals.
*At WGU it is not typical for an individual to be hired at or near the top of the range for their position and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is:
*Pay Range: $170400.00 - $281200.00
Experience in lieu of education
An equivalent combination of training experience credentials or accomplishments demonstrating the ability to perform the essential functions of this job may substitute for education degree requirements.
Position & Application Details
Full-Time Regular Positions (classified as regular and working 40 standard weekly hours): This is a full-time regular position (classified for 40 standard weekly hours) that is eligible for bonuses; medical dental vision telehealth and mental healthcare; health savings account and flexible spending account; basic and voluntary life insurance; disability coverage; accident critical illness and hospital indemnity supplemental coverages; legal and identity theft coverage; retirement savings plan; wellbeing program; discounted WGU tuition; and flexible paid time off for rest and relaxation with no need for accrual flexible paid sick time with no need for accrual 11 paid holidays and other paid leaves including up to 12 weeks of parental leave.
How to Apply: If interested an application will need to be submitted online. Internal WGU employees will need to apply through the internal job board in Workday.
Additional Information
Disclaimer: The job posting highlights the most critical responsibilities and requirements of the job. Its not all-inclusive.
Accommodations: Applicants with disabilities who require assistance or accommodation during the application or interview process should contact our Talent Acquisition team at
Equal Employment Opportunity: All qualified applicants will receive consideration for employment without regard to any protected characteristic as required by law.