Staff Engineer, Site Reliability

Job Location:

Mountain View, CA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

Site Health Platform sits at the core of LinkedIns Reliability Infrastructure organization with a primary focus on the end-to-end incident management ecosystem. Our mission is for every member and customer to experience LinkedIn as always on every engineer to benefit from a more insightful and proactive site-wide reliability ecosystem and every business and product owner to be well-informed about service disruptions as they occur.

We own the full incident lifecycle across thousands of services and multiple regions from incident response and mitigation through problem management and post-incident learning. The platforms we build are the backbone of how LinkedIn detects issues coordinates incident response captures context and turns outages and near misses into structured actionable insights.

By transforming incidents into data and learnings we enable teams to systematically improve reliability over time. Our work informs engineering priorities infrastructure investments capacity planning and executive decision-making ensuring the network is dependable when it matters most.

You will be exposed to many different technologies architectures and systems hosted in state-of-the-art data centers across the globe.

At LinkedIn our approach to flexible work is centered on trust and optimized for culture connection clarity and the evolving needs of our business. The work location of this role is hybrid meaning it will be performed both from home and from a LinkedIn office on select days as determined by the business needs of the team.

Qualifications :

Responsibilities:

Designing and evolving the core incident management platforms that power LinkedIns full incident lifecycle from detection and response to problem management and prevention across thousands of services and teams.
Serving in a critical on-call rotation providing expert incident triage and coordination during high-severity outages. Partnering closely with service owners and product teams to diagnose issues quickly mitigate member impact and drive timely resolution under pressure.
Transforming raw unstructured incident data into clear actionable intelligence using AI and LLM-based systems including automated summarization classification root cause signals and mitigation recommendations.
Building analytics and insights that surface systemic reliability risks recurring failure patterns and cross-service dependencies enabling org-level prioritization rather than isolated service-by-service fixes.
Building platforms and tools that enable realistic fleet-wide stress testing of data center and regional capacity validating incident readiness across dependencies traffic patterns and growth scenarios before they impact a significant production outage.
Driving consistency clarity and quality in how incidents are declared managed reviewed and learned from raising the reliability bar across a large fast-moving engineering organization.
Influencing service architecture SLOs and reliability standards through platforms data and technical leadership ensuring improvements are durable measurable and adopted at scale.

Basic Qualifications:

Bachelors degree in Computer Science Engineering or related technical field or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
6 years of professional experience in software development distributed systems or reliability engineering. Some Principal/Staff roles list around 10 years of experience.
Several years of experience leading technical projects or providing architectural leadership (often 3-4 years)
Software engineering fundamentals with deep experience in building products and operating large-scale distributed systems.
Expertise in two or more backend languages such as Go Python or Java with a track record of owning complex production systems.
Full-stack engineering experience including building user-facing web applications and operational dashboards using modern frontend frameworks such as along with backend APIs and data pipelines.
Understanding of web development fundamentals including API design performance accessibility and building intuitive interfaces for engineers and operational users.
Understanding of reliability engineering principles incident management observability and operating systems under failure conditions.
Demonstrated ability to lead technical design across teams influence architecture beyond direct ownership and drive adoption through well-designed platforms.
Experience with debugging and root cause analysis skills with the ability to communicate complex technical findings clearly to engineers partners and leadership.

Preferred Qualifications:

Bachelors degree in Computer Science Engineering or related technical field or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
8 years of professional experience in software development distributed systems or reliability engineering. Some Principal/Staff roles list around 10 years of experience.
Several years of experience leading technical projects or providing architectural leadership (often 3-4 years)
Experience applying AI or LLM-based techniques to operational or incident data including automated summarization classification root cause hypothesis generation or reliability recommendations.
Familiarity with vector databases and retrieval-based systems used to power context-aware analytics search or agentic workflows.
Frontend craftsmanship beyond basic UI including building data-dense high-signal interfaces for engineers using modern state management and visualization libraries.
Experience designing end-to-end full-stack systems where frontend backend data and reliability concerns are considered holistically.
Background in building internal developer platforms observability tools or incident response systems used at scale.
A demonstrated ability to simplify complex workflows reduce operational toil and replace manual processes with well-designed automation.

Additional Information :

You will Benefit from our Culture

We strongly believe in the well-being of our employees and their families. That is why we offer generous health and wellness programs and time away for employees of all levels. LinkedIn is committed to fair and equitable compensation practices.

The pay range for this role is $152000 to $248000. Actual compensation packages are based on several factors that are unique to each candidate including but not limited to skill set depth of experience certifications and specific work location. This may be different in other locations due to differences in the cost of labor.

Equal Opportunity Statement

We seek candidates with a wide range of perspectives and backgrounds and we are proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race color religion creed gender national origin age disability veteran status marital status pregnancy sex gender expression or identity sexual orientation citizenship or any other legally protected class.

LinkedIn is committed to offering an inclusive and accessible experience for all job seekers including individuals with disabilities. Our goal is to foster an inclusive and accessible workplace where everyone has the opportunity to be successful.

If you need a reasonable accommodation to search for a job opening apply for a position or participate in the interview process connect with us at and describe the specific accommodation requested for a disability-related limitation.

Reasonable accommodations are modifications or adjustments to the application or hiring process that would enable you to fully participate in that process. Examples of reasonable accommodations include but are not limited to:

Documents in alternate formats or read aloud to you
Having interviews in an accessible location
Being accompanied by a service dog
Having a sign language interpreter present for the interview

A request for an accommodation will be responded to within three business days. However non-disability related requests such as following up on an application will not receive a response.

LinkedIn will not discharge or in any other manner discriminate against employees or applicants because they have inquired about discussed or disclosed their own pay or the pay of another employee or applicant. However employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information unless the disclosure is (a) in response to a formal complaint or charge (b) in furtherance of an investigation proceeding hearing or action including an investigation conducted by LinkedIn or (c) consistent with LinkedIns legal duty to furnish information.

San Francisco Fair Chance Ordinance

Pursuant to the San Francisco Fair Chance Ordinance LinkedIn will consider for employment qualified applicants with arrest and conviction records.

Pay Transparency Policy Statement

As a federal contractor LinkedIn follows the Pay Transparency and non-discrimination provisions described at this link: Data Privacy Notice for Job Candidates

Please follow this link to access the document that provides transparency around the way in which LinkedIn handles personal data of employees and job applicants: Work :

Employment Type :

Full-time

Qualifications :

Responsibilities:

Designing and evolving the core incident management platforms that power LinkedIns full incident lifecycle from detection and response to problem management and prevention across thousands of services and teams.
Serving in a critical on-call rotation providing expert incident triage and coordination during high-severity outages. Partnering closely with service owners and product teams to diagnose issues quickly mitigate member impact and drive timely resolution under pressure.
Transforming raw unstructured incident data into clear actionable intelligence using AI and LLM-based systems including automated summarization classification root cause signals and mitigation recommendations.
Building analytics and insights that surface systemic reliability risks recurring failure patterns and cross-service dependencies enabling org-level prioritization rather than isolated service-by-service fixes.
Building platforms and tools that enable realistic fleet-wide stress testing of data center and regional capacity validating incident readiness across dependencies traffic patterns and growth scenarios before they impact a significant production outage.
Driving consistency clarity and quality in how incidents are declared managed reviewed and learned from raising the reliability bar across a large fast-moving engineering organization.
Influencing service architecture SLOs and reliability standards through platforms data and technical leadership ensuring improvements are durable measurable and adopted at scale.

Basic Qualifications:

Bachelors degree in Computer Science Engineering or related technical field or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
6 years of professional experience in software development distributed systems or reliability engineering. Some Principal/Staff roles list around 10 years of experience.
Several years of experience leading technical projects or providing architectural leadership (often 3-4 years)
Software engineering fundamentals with deep experience in building products and operating large-scale distributed systems.
Expertise in two or more backend languages such as Go Python or Java with a track record of owning complex production systems.
Full-stack engineering experience including building user-facing web applications and operational dashboards using modern frontend frameworks such as along with backend APIs and data pipelines.
Understanding of web development fundamentals including API design performance accessibility and building intuitive interfaces for engineers and operational users.
Understanding of reliability engineering principles incident management observability and operating systems under failure conditions.
Demonstrated ability to lead technical design across teams influence architecture beyond direct ownership and drive adoption through well-designed platforms.
Experience with debugging and root cause analysis skills with the ability to communicate complex technical findings clearly to engineers partners and leadership.

Preferred Qualifications:

Bachelors degree in Computer Science Engineering or related technical field or equivalent practical experience. Many postings also prefer or require an advanced degree (MS/PhD) for Staff-level roles.
8 years of professional experience in software development distributed systems or reliability engineering. Some Principal/Staff roles list around 10 years of experience.
Several years of experience leading technical projects or providing architectural leadership (often 3-4 years)
Experience applying AI or LLM-based techniques to operational or incident data including automated summarization classification root cause hypothesis generation or reliability recommendations.
Familiarity with vector databases and retrieval-based systems used to power context-aware analytics search or agentic workflows.
Frontend craftsmanship beyond basic UI including building data-dense high-signal interfaces for engineers using modern state management and visualization libraries.
Experience designing end-to-end full-stack systems where frontend backend data and reliability concerns are considered holistically.
Background in building internal developer platforms observability tools or incident response systems used at scale.
A demonstrated ability to simplify complex workflows reduce operational toil and replace manual processes with well-designed automation.

Additional Information :

Equal Opportunity Statement

Documents in alternate formats or read aloud to you
Having interviews in an accessible location
Being accompanied by a service dog
Having a sign language interpreter present for the interview

A request for an accommodation will be responded to within three business days. However non-disability related requests such as following up on an application will not receive a response.

San Francisco Fair Chance Ordinance

Pursuant to the San Francisco Fair Chance Ordinance LinkedIn will consider for employment qualified applicants with arrest and conviction records.

Pay Transparency Policy Statement

As a federal contractor LinkedIn follows the Pay Transparency and non-discrimination provisions described at this link: Data Privacy Notice for Job Candidates

Please follow this link to access the document that provides transparency around the way in which LinkedIn handles personal data of employees and job applicants: Work :

Employment Type :

Full-time

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

LinkedIn is the world’s largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. We’re ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click