Senior Lead Site Reliability Engineer

JPMorganChase

Job Location:

Glasgow - UK

Monthly Salary: Not Disclosed

Posted on: 13 hours ago

Vacancies: 1 Vacancy

Job Summary

Description

Be an integral part of an agile team thats constantly pushing the envelope to enhance build and deliver top-notch reliability and observability for our most critical platforms.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Bank you are an integral part of an agile team that works to enhance build and deliver trusted market-leading technology products in a secure stable and scalable way. Drive significant business impact through your capabilities and contributions and apply deep technical expertise and problem-solving methodologies to tackle a diverse array of reliability observability and performance challenges that span multiple technologies and applications.

Job responsibilities

Regularly provides technical guidance and direction on site reliability practices to support the business and its technical teams contractors and vendors
Develops secure and high-quality production code for reliability tooling and telemetry pipelines and reviews and debugs code written by others
Drives decisions that influence reliability design observability architecture application functionality and technical operations and processes
Serves as a function-wide subject matter expert in one or more areas of site reliability observability or telemetry engineering
Leads resiliency design reviews and breaks up complex reliability problems into digestible work for other engineers acting as a technical lead for large-sized products
Acts as the main point of contact during major incidents demonstrating the skills to identify and solve issues quickly to avoid financial losses and champions blameless postmortem culture
Collaborates with team members and stakeholders to define comprehensive service level indicators service level objectives and error budgets
Designs implements and maintains operational reliability for large-scale OpenTelemetry pipelines on hybrid on-prem/cloud environments supporting telemetry ingestion processing and export to backends such as InfluxDB Prometheus Elasticsearch and OpenSearch
Drives the assessment refactoring and incremental migration of custom legacy telemetry collection code to standardized OpenTelemetry instrumentation reducing technical debt while maintaining system stability
Actively contributes to the engineering community as an advocate of firmwide frameworks tools and practices and influences peers and project decision-makers to consider the use and application of leading-edge observability and reliability technologies
Adds to the team culture of diversity opportunity inclusion and respect

Required qualifications capabilities and skills

Formal training or certification on software engineering concepts and advanced applied experience delivering system design application development testing and operational stability
Advanced knowledge of reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices with considerable in-depth knowledge in one or more technical disciplines (e.g. cloud observability distributed systems etc.)
Advanced proficiency in one or more programming languages (e.g. Java Python Go etc.)
Advanced proficiency and experience in observability such as white and black box monitoring SLO alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog Splunk Elasticsearch etc.
Proficiency in continuous integration and continuous delivery tools (e.g. Jenkins GitLab Terraform etc.)
Experience with container and container orchestration (e.g. ECS Kubernetes Docker etc.)
Hands-on experience with the design deployment and operation of OpenTelemetry collectors in production environments focusing on technical aspects such as configuring optimizing and troubleshooting OTLP endpoints and receivers
Ability to tackle reliability design and functionality problems independently with little to no oversight
Practical cloud native experience
Ability to expand and collaborate across different levels and stakeholder groups

Preferred qualifications capabilities and skills

Knowledge of distributed tracing metrics and logging best practices
Certification in AWS Kubernetes or relevant technologies
Proven track record in system health monitoring capacity management and blameless postmortems for high-availability services
Deep understanding of distributed system design principles networking (TCP/IP DNS load balancing) and Linux internals
Contributions to open-source observability or telemetry projects
Experience working with agent control planes and management protocols; hands-on knowledge of OpAMP is highly desirable

Required Experience:

Senior IC

DescriptionBe an integral part of an agile team thats constantly pushing the envelope to enhance build and deliver top-notch reliability and observability for our most critical platforms.As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Bank you are an i...

Description

Be an integral part of an agile team thats constantly pushing the envelope to enhance build and deliver top-notch reliability and observability for our most critical platforms.

Job responsibilities

Regularly provides technical guidance and direction on site reliability practices to support the business and its technical teams contractors and vendors
Develops secure and high-quality production code for reliability tooling and telemetry pipelines and reviews and debugs code written by others
Drives decisions that influence reliability design observability architecture application functionality and technical operations and processes
Serves as a function-wide subject matter expert in one or more areas of site reliability observability or telemetry engineering
Leads resiliency design reviews and breaks up complex reliability problems into digestible work for other engineers acting as a technical lead for large-sized products
Acts as the main point of contact during major incidents demonstrating the skills to identify and solve issues quickly to avoid financial losses and champions blameless postmortem culture
Collaborates with team members and stakeholders to define comprehensive service level indicators service level objectives and error budgets
Designs implements and maintains operational reliability for large-scale OpenTelemetry pipelines on hybrid on-prem/cloud environments supporting telemetry ingestion processing and export to backends such as InfluxDB Prometheus Elasticsearch and OpenSearch
Drives the assessment refactoring and incremental migration of custom legacy telemetry collection code to standardized OpenTelemetry instrumentation reducing technical debt while maintaining system stability
Actively contributes to the engineering community as an advocate of firmwide frameworks tools and practices and influences peers and project decision-makers to consider the use and application of leading-edge observability and reliability technologies
Adds to the team culture of diversity opportunity inclusion and respect

Required qualifications capabilities and skills

Formal training or certification on software engineering concepts and advanced applied experience delivering system design application development testing and operational stability
Advanced knowledge of reliability scalability performance security enterprise system architecture toil reduction and other site reliability best practices with considerable in-depth knowledge in one or more technical disciplines (e.g. cloud observability distributed systems etc.)
Advanced proficiency in one or more programming languages (e.g. Java Python Go etc.)
Advanced proficiency and experience in observability such as white and black box monitoring SLO alerting and telemetry collection using tools such as Grafana Dynatrace Prometheus Datadog Splunk Elasticsearch etc.
Proficiency in continuous integration and continuous delivery tools (e.g. Jenkins GitLab Terraform etc.)
Experience with container and container orchestration (e.g. ECS Kubernetes Docker etc.)
Hands-on experience with the design deployment and operation of OpenTelemetry collectors in production environments focusing on technical aspects such as configuring optimizing and troubleshooting OTLP endpoints and receivers
Ability to tackle reliability design and functionality problems independently with little to no oversight
Practical cloud native experience
Ability to expand and collaborate across different levels and stakeholder groups

Preferred qualifications capabilities and skills

Knowledge of distributed tracing metrics and logging best practices
Certification in AWS Kubernetes or relevant technologies
Proven track record in system health monitoring capacity management and blameless postmortems for high-availability services
Deep understanding of distributed system design principles networking (TCP/IP DNS load balancing) and Linux internals
Contributions to open-source observability or telemetry projects
Experience working with agent control planes and management protocols; hands-on knowledge of OpAMP is highly desirable

Required Experience:

Senior IC

Apply Now

About Company

JPMorganChase

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans ov ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click