Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailThe Principal Site Reliability Engineer (Principal SRE) plays a pivotal role in ensuring the seamless and reliable operation of an organizations digital infrastructure. This highly technical position will enhance the performance scalability and reliability of the organizations complex systems and applications. It will reduce time to detect and restore systems increase uptime and improve incident response by utilizing best practices in automation monitoring and incident management. This role requires a deep understanding of Cloud technologies Distributed Systems Automation / Scripting Observability Software Engineering DevOps and will take a proactive approach to preventing and mitigating potential issues. This role will report to the Director of Site Reliability Engineering and will help foster a culture of innovation continuous improvement and collaboration within the team to meet the organizations evolving needs and deliver a superior digital experience to users.
This is a Remote position available in the United States.
Bright Horizons is trusted by families and employers around the world for highquality childcare and early education backup care and workplace education. We partner with some of the worlds best companies to provide services that help employees perform their best and support families to thrive both personally and professionally.
Reliability and Scalability: Contribute significantly to the reliability scalability and availability of Bright Horizons digital infrastructure by enforcing best practices of redundancy and resiliency across applications and infrastructure.
Observability: Implement robust infrastructure application and digitalexperience monitoring in our enterprisewide APM tool Dynatrace. Proactively identify potential issues analyze system performance and facilitate quick response to incidents. Create dashboards alerts and automated workflows that can be utilized by other Operations or Application teams.
Incident Management: Drive troubleshooting of critical incidents through developing a deep and broad understanding of our enterprise architecture across all 7 OSI layers. Utilize monitoring and alerting to ensure timely incident resolution. Track KPIs like MTTD/MTTR and identify shortterm and mediumterm opportunities to improve. Conduct postmortems to identify root cause and implement preventive measures.
Automation and Efficiency: Drive the development and implementation of automation solutions to streamline processes reduce manual interventions and enhance the overall efficiency of the Product Engineering and SRE teams.
Tools Ownership: Besides owning Observability tools create a roadmap to expand and consolidate. This should provide a 360degree view of crossfunctional areas like SRE DevOps Application Support Monitoring Incident Management Infrastructure and Enterprise Architecture.
Collaboration: Collaborate with the above crossfunctional teams to drive a unified approach to site reliability that optimizes their work and improves timetomarket for all respective objectives. Foster strong relationships with these delivery organizations to implement an SRE culture that delivers organizational goals.
Infrastructure Roadmap and System Capacity Planning: Work closely with Infrastructure and Architecture teams to design and implement roadmaps for scaling server and serverless architecture using Containers as well as IaC tools like Ansible Terraform etc. Conduct disaster recovery and controlled failure testing to improve resiliency. Conduct capacity planning to handle current and future demand.
Bachelors degree in Computer ScienceEngineering or related field Required
A minimum of 10 years of experience including at least 5 years in the SRE field with a proven track record of progressively increasing responsibilities
Masters degree in Computer ScienceEngineering or related field Preferred
Demonstrated ability to work with crossfunctional Development QE and Operations teams to understand the underlying architecture and help improve its reliability and scalability.
Strong understanding and experience in automation tools and programming/scripting languages (e.g. PowerShell Python Bash) to deliver improvements at a small and large scale.
Strong understanding of Observability tools (e.g. Dynatrace Datadog New Relic etc. and best practices to implement effective monitoring of SLI/SLO/SLAs.
Strong experience and understanding of software engineering Infrastructure as Code (Ansible or Terraform) and build/deployment pipelines.
Strong troubleshooting skills coupled with making datadriven decisions during incidents to improve time to detect and resolve issues.
Strong understanding of cloud computing platforms (Azure or Google Cloud) and cloudnative setups (AKS serverless etc..
A can do attitude is necessary combined with a deep belief that everything can be automated and systems must always be functional.
Preference may be given to candidates with relevant certifications demonstrating cloud and reliability engineering expertise.
At this time Bright Horizons will not sponsor an applicant for employment authorization/visa for this position.
Compensation:
The annual salary for this position is between $150000 160000 annually. The pay range listed here is what Bright Horizons in good faith anticipates offering for this job opening. Actual compensation offers within this range will depend on a variety of factors including experience education and training certifications geography and other relevant business or organizational factors. This role is also eligible for a 5 annual bonus.
Benefits:
Bright Horizons offers the following benefits for this position subject to applicable eligibility requirements:
Medical dental and vision insurance
401(k) retirement plan
Life insurance
Longterm and shortterm disability insurance
Also depending on hire date and subject to applicable eligibility requirements and accrual schedules new employees in this role receive up to: 9 paid holidays annually; 40 hours of sick time per year based on fulltime schedule and 120 hours of vacation time per year based on fulltime schedule(vacation time may be used for sick leave purposes under any applicable state or local sick or safe time law).
Deadline to Apply:
This posting is anticipated to remain open until 5/16/2025.
Our people are the heart of our company. Because were as committed to our own employees as we are to the children families and clients we serve our collaborative workplaces are designed to grow careers and support personal lives. Come build a brighter future with us.
Bright Horizons provides equal opportunity in all aspects of employment and does not discriminate against any individual on the basis of race color religion sex age disability sexual orientation veteran status national origin genetic information or any other characteristic protected under federal state or local law. Bright Horizons complies with the laws and regulations described in the following federal government resources: Know Your Rights Family and Medical Leave Act (FMLA) and Employee Polygraph Protection Act (EPPA).
If you require assistance or a reasonable accommodation in completing these application materials or any aspect of the application and hiring process please contact the recruitment helpdesk ator . Determinations on requests for reasonable accommodation will be made on a casebycase basis.
Required Experience:
Staff IC
Full-Time