Epic Principal Site Reliability Engineer

Quest Diagnostics

Job Location:

Secaucus, NJ - USA

Monthly Salary: $ 150000 - 170000

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

As a Principal Site Reliability Engineering you will be responsible for building a SRE practice monitoring and performance engineering best practices which will be aligned to our agile teams to help drive availability resiliency and stability of Quest products platform and services.

You are an engineering technical leader who has a passion for reliability and have a wide breath of experience. Ideally you will have had experience as a Site Reliability and Observability Engineer where you made significant improvements to the products/services/platforms and customer experience. You will also partner with architecture engineers security and operations to design and build reusable patterns to deploy reliable and resilient solutions.

You will also have responsibility to attract retain and grow top SRE engineering talent providing guidance and mentorship to team members.

You will bring empathy humility and a continuous learning mindset to every interaction. You are motivated to innovate and create to always do the right thing and to improve both what we build and how we build it.

Pay Range: $00 plus yearly bonus (New Jersey)

Salary offers are based on a wide range of factors including relevant skills training experience education and where applicable certifications obtained. Market and organizational factors are also considered. Successful candidates may be eligible to receive annual performance bonus compensation.

Remote: This position supporting Epic can be 100% remote if not located near a hub location within certain criteria.

Benefits Information: We are proud to offer best-in-class benefits and programs to support employees and their families in living healthy happy lives. Our pay and benefit plans have been designed to promote employee health in all respects physical financial and developmental. Depending on whether it is a part-time or full-time position some of the benefits offered may include:

Day 1 Medical supplemental health dental & vision for FT employees who work 30 hours
Best-in-class well-being programs
Annual no-cost health assessment program
Blueprint for Wellness
healthyMINDS mental health program
Vacation and Health/Flex Time
6 Holidays plus 1 MyDay off
FinFit financial coaching and services
401(k) pre-tax and/or Roth IRA with company match up to 5% after 12 months of service
Employee stock purchase plan
Life and disability insurance plus buy-up option
Flexible Spending Accounts Annual incentive plans
Matching gifts program
Education assistance through MyQuest for Education Career advancement opportunities and so much more!

Responsibilities

Experience in transforming an organization by designing and implementing SRE capabilities including monitoring performance and chaos engineering. You will set the strategy for overall Site Reliability Engineering (SRE)/Development alignment
Lead initiatives to implement service levels (SLIs SLOs SLAs) and error budgets. You will initiate influence and drive SRE within the organization and work with product and service teams to enable this model.
Provides guidelines/patterns and establishes proper metrics for building highly scalable reliable high performing systems
Strategizes best in class monitoring frameworks to accomplish end to end flow monitoring and meaningful alerting.
Coaches and mentors teams of monitoring performance and SRE engineers.
Proven ability to implement processes solutions and engineering capabilities at scale.
Prior experience in large scale digital technologies where uptime and continuous availability was core to the business.
Strong acumen of public cloud and / or private cloud implementation and application adoption
Strong understanding of Cloud API Event Driven and Microservices technologies for large scale environments.
Influences other leaders principals and engineers opening the discussion and adoption for implementing SRE best practices.
Builds relationships with other leaders and groups across the company providing understanding of SRE concepts and value.
Work with other team leads to identify improvements outside of SRE i.e. DevOps Quality etc.
Partners with the Director of SRE to build platform roadmaps frameworks and identify team/process improvements.
Technical owner of SRE tools with expertise and understanding of current and other widely used industry tools.
Evaluates other tools/solutions for SRE to ensure IT is being cost aware and tool egnostic.

Qualifications

Required WorkExperience:

10 years of experience in developing enterprise software and proficiency in multiple languages e.g. Java and web technologies (Python Go Perl Ruby or shell scripting)
5 years in implementing SRE solutions/practices.
5 years in mentoring and coaching.
Expert knowledge of Dynatrace as product owner user and
Expert with a proven track record in delivering technology solutions and leading a high performing SRE team in automating manual work.
Expert knowledge of reliability and production management domains
Experience in public cloud environments (AWS/Azure/Google Cloud).
Experience in leading operations leveraging key event streaming messaging and DB services e.g. Casandra MQ/JMS/Kafka Aurora RDS Cloud SQL BigTable DynamoDB Cloud Spanner Kinesis Cloud Pub/Sub etc.
Experience in either SAFe agile Scrum or Kanban model
Expertise in DevSecOps practices and tools e.g. CI/CD Gitlab and any security scanning tools.
Experience with cloud-based technologies and tools especially in deployment monitoring and operations
Strong experience and technical skills in developing/managing APIs and Microservices
Expert practitioner in multiple technology domains may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm

Preferred Work Experience:

Experience with containerization (Docker Kubernetes)
Experience with Terraform and Ansible
Experience with SEIM
Experience with other APM tools
Healthcare industry experience

Physical and Mental Requirements:

Ability to sit for long periods of time

Knowledge:

Compliance requirements e.g. NIST CFR21 ISO GDPR HIPAA SOX
HL7 specifications
Integration Platform technologies (Mulesoft Informatica SnapLogic Jitterbit etc.)

Skills:

Self-driven
Problem solving
Adaptable
Negotiation
Prioritization

Education

Bachelors Degree Bachelors in computer engineering or something similar or equivalent work experience (Required)
Masters Degree Masters in computer engineering (Preferred)

Languages

English (Preferred)

Licenses and Certifications

AWS (Preferred)
Azure (Preferred)
GPC (Preferred)

Work Requirements

Travel Required up to 30%

Required Experience:

Staff IC

DescriptionAs a Principal Site Reliability Engineering you will be responsible for building a SRE practice monitoring and performance engineering best practices which will be aligned to our agile teams to help drive availability resiliency and stability of Quest products platform and services.You ar...