Infrastructure Reliability Engineer
THE COMPANY:
STACK INFRASTRUCTURE (STACK) provides digital infrastructure to scale the worlds most innovative companies. We are an award-winning industry leader in building owning and operating highly efficient cost-effective wholesale colocation and cloud data centers. Each of our national facilities meets or exceeds the highest industry standards in all operational categories of availability security connectivity and physical resilience.
STACK offers the scale and geographic reach that rapidly growing hyperscale and enterprise companies need. The world runs on data. Data runs on STACK.
THE POSITION:
STACK is looking for an Infrastructure Reliability Engineer who will act as a key member of STACKs Critical Operations team. This position will play a vital role in ensuring the ongoing performance resiliency and evolution of infrastructure systems across STACKs portfolio. This role requires deep technical fluency in data center power and cooling systems a forensic mindset for failure analysis and a proactive approach to risk reduction.
RESPONSIBILITIES:
- Lead deep-dive root cause analyses (RCAs) for critical incidentsconnecting technical failures to design process and operational contributors.
- Inform and influence the design review and turnover process by identifying gaps in infrastructure handoffs system limitations or commissioning practices.
- Develop system-level failure mode mitigation strategies that improve uptime performance and reduce repeat incidents.
- Partner withOperations Engineering and Construction to identifydesign improvementsneeded to enhance operational reliability
- Engage Original Equipment Manufacturers (OEMs) and vendors to challenge technical assumptions and advocate for long-term improvements.
- Support the evolution of maintenance standards and asset strategy for high-risk or complex systems (e.g. power distribution cooling).
- Collaborate with Learning and Development to enhance technical training for site teams based on lessons from event investigations.
- Contribute to availability reporting event response improvement and risk trend monitoring to ensure service level agreements (SLA) commitments are met.
THE DETAILS:
- Location: Chicago (CHI) or Dallas-Fort Worth (DFW)
- Compensation: $170K - $190K plus 10% bonus potential
- Travel: 25% domestically
- Must be eligible to work in the United States
- Must pass a comprehensive background screening
MUST-HAVE QUALIFICATIONS:
- Bachelors degree in engineering or equivalent experience with high technical competency.
- 58 years of experience in critical infrastructure environments (e.g. data centers substations power generation or utility systems).
- Strong technical fluency in electrical and/or mechanical systemspower distribution uninterruptible power supply (UPS) generators control systems and heating ventilation and air conditioning (HVAC).
- Hands-on experience with root cause analysis and reliability methodologies (e.g. failure more and effects analysis (FMEA) revenue cycle management (RCM).
- Demonstrated ability to work across disciplines to resolve complex technical issues.
- Expertise with commissioning (Cx) and infrastructure design review processes.
- Ability to analyze performance data and translate findings into practical improvements.
THIS MIGHT BE RIGHT FOR YOU IF:
- Youre the person people call when something went wrongand you love figuring out why.
- You bring rigor and precision to every failure analysis and dont settle for surface-level fixes.
- You want to engineer reliability not just react to issues.
- You enjoy working cross-functional and are a collaborator who builds trust and consensus.
- Youre driven by impact not egoand you measure success by improved system resilience.
- You thrive in the space between design intent and operational reality.
PREFERRED QUALIFICATIONS:
- Experience reviewing or developing engineering specifications.
- Background in vendor/OEM engagement and technical contract negotiation.
- Familiarity with computerized maintenance management system (CMMS) data center infrastructure management (DCIM) or reliability-centered asset programs.
- Understanding of availability metrics and SLA management frameworks.
- Technical training or mentoring experience in field operations environments.
WHY STACK
- We offer a competitive compensation package with strong benefits including medical dental and vision insurance a 401K program flexible spending accounts even a cell phone subsidy.
- We foster a culture of appreciation including peer-to-peer recognition and rewards programs.
- Fun is part of our DNA with events game nights happy hours and barbecues.
- Were growing this is a great time to join and make an impact!
STACK is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race color religion sex sexual orientation gender identity and expression age national origin mental or physical disability genetic information veteran status or any other status protected by federal state or local law.
Note to external agencies: we are not accepting any blind submissions or resumes/cvs from recruitment agencies. Any candidates sent to STACK Infrastructure will not be accepted or considered as a submission without a signed agreement in place.
#LI - LW