Site Reliability Engineer (SRE) Artificial Intelligence (AI) Engineer

Honolulu, HI - USA

Monthly Salary: $ 131300 - 237350

Posted on: 18 hours ago

Vacancies: 1 Vacancy

Job Summary

The U.S. Navys Service Management Integration and Transport (SMIT) program has an opening for a Site Reliability Automation and Orchestration Engineer on a high-visibility DoD program that provides engineering support to the Navy Marine Corps Intranet (NMCI) the largest information technology (IT) network in the world. This position will provide many opportunities to challenge and grow your skills.

The AI Reliability Engineer (AI-SRE) is responsible for integrating artificial intelligence and machine learning capabilities into Site Reliability Engineering (SRE) operations to improve system reliability availability performance and operational efficiency. This role serves as a horizontal enabler across SRE pods leveraging AI-driven insights to reduce operational toil accelerating incident response enhance observability and enable predictive reliability engineering. The AI-SRE partners closely with infrastructure network application cyber and platform SRE teams to transform operational data into actionable intelligence while ensuring AI solutions are safe explainable auditable and aligned with SRE principles.

Key Responsibilities

AIOps & Observability Intelligence

Design develop and maintain AI/ML models for anomaly detection trend analysis and signal correlation across metrics logs traces and events.
Reduce alert noise through intelligent alert grouping suppression and prioritization.
Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.

AI-Assisted Incident Management

Implement AI-driven incident classification enrichment and summarization.
Provide probable root-cause analysis recommendations based on historical and real-time telemetry.
Support on-call and incident response teams with AI-guided remediation suggestions.
Contribute AI insights to post-incident reviews and reliability improvement plans.

Automation & Ops-as-Code Enablement

Apply AI techniques to identify repetitive operational tasks and automation opportunities.
Assist in generating validating and optimizing automation playbooks and workflows.
Analyze automation execution data to improve success rates resiliency and reuse.

Knowledge Management & Runbook Intelligence

Build and maintain AI-searchable knowledge repositories containing runbooks SOPs lessons learned and historical incident data.
Enable natural-language access to operational knowledge for SREs and operations staff.
Reduce dependency on tribal knowledge through intelligent documentation and discovery.

Predictive Reliability Engineering

Develop predictive models for capacity planning failure forecasting configuration risk and reliability debt identification.
Support proactive remediation strategies to prevent incidents before customer impact.
Assist SRE leadership in data-driven prioritization of reliability investments.

Governance Security & Trust

Ensure AI solutions adhere to organizational security compliance and data-handling policies.
Establish guardrails for AI recommendations human-in-the-loop decision making and automation execution.
Promote transparency explainability and auditability of AI-driven operational decisions.

Required Qualifications

Education and Requirements:

Bachelors degree in computer science Engineering Information Systems Data Science or related discipline
5 years in Site Reliability Engineering DevOps IT Operations or Systems Engineering
2 years applying AI/ML techniques in operational analytics or automation contexts
Demonstrated experience supporting production systems in high-availability environments
Must have an active Secret Clearance in order to be considered for the position

Technical Skills

Proficiency in data analysis tooling
Experience with machine learning fundamentals (anomaly detection clustering time-series analysis NLP)
Familiarity with observability platforms (metrics logs traces events)
Experience with automation frameworks and infrastructure-as-code concepts
Strong understanding of distributed systems and operational telemetry

If youre looking for comfort keep scrolling. At Leidos we outthink outbuild and outpace the status quo because the mission demands it. Were not hiring followers. Were recruiting the ones who disrupt provoke and refuse to fail. Step 10 is ancient history. Were already at step 30 and moving faster than anyone else dares.

Original Posting:

February 18 2026

For U.S. Positions: While subject to change based on business needs Leidos reasonably anticipates that this job requisition will remain open for at least 3 days with an anticipated close date of no earlier than 3 days after the original posting date as listed above.

Pay Range:

Pay Range $131300.00 - $237350.00

The Leidos pay range for this job level is a general guideline onlyand not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job education experience knowledge skills and abilities as well as internal equity alignment with market data applicable bargaining agreement (if any) or other law.

Required Experience:

Key Responsibilities

AIOps & Observability Intelligence

Design develop and maintain AI/ML models for anomaly detection trend analysis and signal correlation across metrics logs traces and events.
Reduce alert noise through intelligent alert grouping suppression and prioritization.
Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.

AI-Assisted Incident Management

Implement AI-driven incident classification enrichment and summarization.
Provide probable root-cause analysis recommendations based on historical and real-time telemetry.
Support on-call and incident response teams with AI-guided remediation suggestions.
Contribute AI insights to post-incident reviews and reliability improvement plans.

Automation & Ops-as-Code Enablement

Apply AI techniques to identify repetitive operational tasks and automation opportunities.
Assist in generating validating and optimizing automation playbooks and workflows.
Analyze automation execution data to improve success rates resiliency and reuse.

Knowledge Management & Runbook Intelligence

Build and maintain AI-searchable knowledge repositories containing runbooks SOPs lessons learned and historical incident data.
Enable natural-language access to operational knowledge for SREs and operations staff.
Reduce dependency on tribal knowledge through intelligent documentation and discovery.

Predictive Reliability Engineering

Develop predictive models for capacity planning failure forecasting configuration risk and reliability debt identification.
Support proactive remediation strategies to prevent incidents before customer impact.
Assist SRE leadership in data-driven prioritization of reliability investments.

Governance Security & Trust

Ensure AI solutions adhere to organizational security compliance and data-handling policies.
Establish guardrails for AI recommendations human-in-the-loop decision making and automation execution.
Promote transparency explainability and auditability of AI-driven operational decisions.

Required Qualifications

Education and Requirements:

Bachelors degree in computer science Engineering Information Systems Data Science or related discipline
5 years in Site Reliability Engineering DevOps IT Operations or Systems Engineering
2 years applying AI/ML techniques in operational analytics or automation contexts
Demonstrated experience supporting production systems in high-availability environments
Must have an active Secret Clearance in order to be considered for the position

Technical Skills

Proficiency in data analysis tooling
Experience with machine learning fundamentals (anomaly detection clustering time-series analysis NLP)
Familiarity with observability platforms (metrics logs traces events)
Experience with automation frameworks and infrastructure-as-code concepts
Strong understanding of distributed systems and operational telemetry

Original Posting:

February 18 2026

Pay Range:

Pay Range $131300.00 - $237350.00

Required Experience:

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

Leidos

Leidos is an innovation company rapidly addressing the world's most vexing challenges in national security and health. Our 47,000 employees collaborate to create smarter technology solutions for customers in these critical markets.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click