Lead Site Reliability Engineer (ServiceNow Platform)
What you get to do in this role:
As the Lead Site Reliability Engineer (SRE) you will spearhead the design and implementation of observability and reliability strategies across our ServiceNow platform and integrated third-party systems. Youll lead the charge in establishing and maturing telemetry frameworks ensuring the visibility of golden signals-latency traffic errors and saturation-to drive proactive performance and availability management.
This role is both strategic and hands-on. You will mentor other engineers collaborate with cross-functional teams and influence platform-wide improvements. Your work will directly enhance system resilience user experience and operational excellence.
Key Responsibilities:
- Architect and implement telemetry and observability frameworks across ServiceNow and its ecosystem.
- Define and monitor golden signals to drive proactive SRE practices.
- Lead incident and problem management reviews ensuring data-driven root cause analysis and continuous improvement.
- Collaborate with development support and infrastructure teams to implement self-healing auto-remediation and resiliency patterns.
- Develop and mature dashboards and real-time alerts using tools like ServiceNow Platform along with Datadog Splunk or Grafana.
- Drive automation for reliability checks capacity planning and environment health.
- Establish and promote SRE best practices playbooks and operational readiness standards across product teams.
- Represent SRE in architectural reviews and platform governance meetings.
- Mentor junior engineers foster a learning culture and ensure adoption of reliability-first principles.
Qualifications:
- Bachelors or Masters degree in Computer Science Engineering or related technical field.
- 10 years of IT experience with 5 years in SRE or production engineering and 2 years in a lead or principal role.
- Proven experience in managing observability telemetry and incident response frameworks at scale.
- Deep understanding of ITIL-aligned processes (Incident Problem Change).
- Strong leadership and collaboration skills with the ability to influence across engineering and business teams.
- Excellent verbal and written communication especially in articulating technical decisions to business stakeholders.
Technical Requirements:
- Strong experience with monitoring tools such as Datadog Splunk Prometheus Grafana or equivalents.
- Proficient in ServiceNow platform administration performance tuning and API integrations.
- Solid command over Unix/Linux internals system performance tuning and network troubleshooting.
- Proficient in one or more scripting languages: Python Shell JavaScript.
- Hands-on experience with Kubernetes containers and CI/CD pipelines.
- Deep understanding of HTTP/S DNS SSL/TLS and other web protocols.
- Familiarity with cloud platforms (AWS Azure or GCP); certifications preferred.
Preferred (Nice to Have):
- Experience with ServiceNow ITOM modules like Event Management AIOps and Discovery.
- Knowledge of AI/ML-based anomaly detection and alerting strategies.
- Experience with infrastructure-as-code using tools like Ansible Terraform.
- Familiarity with performance profiling and diagnostics of complex applications.
- Previous success in establishing SRE teams or practices from the ground up.
Datadog,Splunk,ServiceNow Platform Administration,Kubernetes