Site Reliability Engineer

ATI Holdings

Posted on : 24-07-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Downers Grove, IL - USA

Monthly Salary

$ 110000 - 150000

Vacancy

1 Vacancy

Posted on : 24-07-2025

Job Description

Overview

The Site Reliability Engineer is responsible for ensuring the reliability availability and performance of the companys information technology systems and infrastructure.

This is a highly skilled role that bridges the gap between development and operations to optimize system performance and therefore requires a proactive mindset to drive innovation and collaborate to set strategy for system-wide improvements.

In this role you will work with the development and operations teams to design build and maintain scalable and robust infrastructure automate processes and troubleshoot and resolve incidents while providing long-term solutions.

This role is a strategic partner capable of recognizing and analyzing trends identifying opportunities and aligning initiatives with organizational goals that directly impacts the stability and efficiency of the companys production environment driving continuous improvement and resilience across the organization.

Responsibilities

System Reliability: Ensure high availability performance and scalability of production systems and infrastructure.
Monitoring & Alerting: Design implement and maintain monitoring tools alerts and dashboards to increase visibility of system performance and to proactively detect and resolve issues before they impact users.
Strategic Partner: Champion forward-looking strategies that anticipate industry trends and position the company for long term success. Translate high-level vision into actionable roadmaps and measurable outcomes. Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
Performance Optimization: Identify bottlenecks and optimize systems and services to improve latency throughput and resource usage. Perform capacity planning and resource allocation to ensure optimal system performance and scalability. Collaborate with development teams to implement and deploy new features and enhancements ensuring they meet reliability and performance standards.
Automation & Tooling: Develop automation for routine tasks deployments and infrastructure management to reduce manual work and improve reliability.
Troubleshooting & Diagnostics: Analyze and resolve critical incidents and problems including system failures performance issues and security breaches.
Incident Management: Respond to Level 3 system outages and performance issues; lead post-incident reviews and implement preventative measures.
Root Cause Analysis: Perform in-depth analysis of recurring issues and provide permanent preventative solutions to reduce future incidents.
Documentation: Create and maintain technical documentation including troubleshooting guides procedures and knowledge base articles and champion the transition to Operations teams.
Continuous Learning: Stay up to date with industry best practices new technologies and emerging trends in site reliability engineering.

Qualifications

Minimum Education Required:

Bachelors Degree IT Computer Science or related field; or comparable work experience.

Minimum Experience

Required:

10 years of progressive experience in IT support roles including hands-on experience in Level 2 support or system/network administration with significant involvement in incident response root cause analysis and driving improvements to system resilience and observability.

3 years of experience in Site Reliability Engineering DevOps or a related role with a strong focus on supporting production systems and experience with monitoring and logging tools such as Azure Monitor Datadog and PRTG or equivalent.

Significant experience with supporting infrastructure technologies such as Windows Server Active Directory Microsoft 365 networking virtualization (e.g. VMware/Hyper-V) and/or Azure or AWS cloud platforms.

Experience leading and coordinating tasks with multiple teams/departments and with multiple users.

Experience in trend analysis identifying process improvement opportunities and providing recommendations in alignment with business goals.

Experience implementing security measures in a production environment.

Preferred:

Proven experience as a Site Reliability Engineer or equivalent role

Experience with agile and iterative development processes.

HIT Experience

Knowledge Skills and Abilities

Strong problem-solving and troubleshooting skills with the ability to analyze and resolve complex technical issues with a focus on continuous improvement and automation.

Excellent communication and collaboration skills to work effectively with cross-functional teams

Proficiency in scripting languages such as PowerShell

Solid understanding of software development methodologies and Dev Ops principles.

Advanced understanding of networking principles and protocols.

Expertise in monitoring and logging tools such as Datadog and PRTG.

Knowledge of containerization technologies and orchestration tools.

Knowledge of security best practices

Ability to use and educate on a variety of processes including documentation automation change management standardization

Ability to use and educate on a variety of technology including Cloud services (Azure and or AWS) Active directory Office 365 MS Teams Active Directory Azure SSO fileservers clustering and network administration and other current technologies

Familiar with ITSM processes and methodologies

Working knowledge of multi-tier architectures: load balancers caching web servers application servers and databases.

Ability to effectively prioritize and execute tasks with strong attention to detail in a high-pressure environment.

Skilled at handling multiple projects simultaneously at times working independently and at times within a team-oriented collaborative environment.

Ability to translate requirements to technical needs.

Licenses and Certifications

Required:

N/A

Preferred:

ITIL 4 certification

Certification in relevant technologies or frameworks is a plus (e.g. AWS Certified Dev Ops Engineer)

Virtual Employee

Yes

Salary Range

$110000 - $150000

Location/Org Data : Dept Number

CORPIL

Employment Type

Full-Time

Company Industry

Key Skills

Apply Now

About Company

ATI Holdings

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Site Reliability Engineer

ATI Holdings

Job Description

Overview

Responsibilities

Qualifications

Virtual Employee

Salary Range

Location/Org Data : Dept Number

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Site Head

Site Technician

ON-SITE - BI Developer - HBITS

Engineer

Engineer

Engineer

Systems Engineer

Retrofit Engineer