DataCenter Engineer
Job Summary
We are seeking a highly skilled and hands-on Data Center Engineer to support a GPU-accelerated data center environment featuring direct-to-chip liquid cooling infrastructure. This role is critical in maintaining uptime responding to infrastructure incidents and ensuring operational excellence across power cooling server and network environments.
The successful candidate will serve as the onsite technical expert during incidents and maintenance activities providing real-time troubleshooting physical investigations vendor coordination and accurate communication with internal teams customers and service providers.
This position is ideal for professionals who thrive in mission-critical environments and can remain calm methodical and decisive during high-pressure situations.
Tasks
Power Incident Response
- Respond immediately to power-related incidents affecting data center operations.
- Investigate facility-level and rack-level power issues including UPS systems PDUs breakers and server power supplies.
- Execute Emergency Operating Procedures (EOPs) and Maintenance Operating Procedures (MOPs).
- Safely isolate faulty equipment and perform approved recovery actions.
- Coordinate with facilities teams and remote engineering teams during incident resolution.
- Monitor infrastructure recovery and verify restoration of servers switches CDUs and supporting systems.
Physical Network Troubleshooting
Troubleshoot physical-layer network issues involving:
Fiber optic cabling
DAC cables
Copper cabling
Patch panels
Network switches
Transceivers and optics
Verify connectivity link status and hardware alarms.
Replace faulty cables optics transceivers and patch cords when required.
Perform fiber cleaning and inspection using approved procedures.
Support remote network engineering teams during outage investigations.
Incident Management & Bridge Call Communication
- Participate in incident bridge calls with facilities teams network engineers vendors management and customers.
- Provide accurate factual and verified onsite observations.
- Maintain detailed timelines of actions taken and observations made during incidents.
- Update incident tickets and documentation in real time.
- Ensure proper operational records are maintained for audit and customer reporting purposes.
Vendor Coordination & Warranty Support
- Escort and supervise vendors contractors and third-party engineers onsite.
- Coordinate warranty repairs and hardware replacement activities.
- Validate the quality and completion of vendor-performed work.
- Ensure deployments and repairs meet operational and quality standards.
Rack & Stack / Deployment Activities
- Install and deploy servers switches and supporting infrastructure.
- Route and organize fiber DAC and copper cabling according to standards.
- Maintain proper cable management labeling and rack organization.
- Verify deployment quality and installation accuracy.
Inventory & Asset Management
Maintain accurate records of:
Servers
Network equipment
Optics and transceivers
Cables and spare parts
Track installations replacements decommissions and inventory movements.
Conduct physical audits and inventory verification.
Maintain cleanliness and organization within data center and storage areas.
Preventive Maintenance & Daily Operations
Perform daily walkthroughs and infrastructure inspections.
Monitor CDU performance including:
Flow rates
Temperatures
Pressure levels
Verify PDU loads cooling systems cable conditions and equipment health.
Identify and report potential risks before they impact operations.
Support scheduled maintenance activities across power cooling and network infrastructure.
Requirements
Required Qualifications
- 35 years of experience in Data Center Operations Critical Facilities Infrastructure Support or a similar environment.
- Strong understanding of data center power systems including:
- UPS systems
- PDUs
- Circuit breakers
- Redundant power architectures
- Hands-on experience with:
- Fiber optics
- DAC cables
- Structured cabling
- Transceivers and optics
- Physical network troubleshooting
- Experience working with servers and network switches.
- Ability to follow runbooks SOPs MOPs and escalation procedures.
- Strong troubleshooting and problem-solving skills.
- Excellent verbal and written communication skills.
- Comfortable working independently during critical incidents.
Preferred Qualifications
- Experience supporting GPU clusters AI infrastructure HPC environments or liquid-cooled data centers.
- Familiarity with CDUs facility chillers and advanced cooling systems.
- Understanding of mission-critical facility operations.
- Experience working with enterprise hardware vendors and warranty processes.
Physical Requirements
- Ability to lift and move equipment up to 50 lbs (23 kg).
- Comfortable working in hot aisle environments.
- Ability to wear required PPE.
- Ability to kneel behind racks work in confined spaces and access under-floor infrastructure when required.
About Company
Founded in 2013 HTS have grown organically year on year and today provide services in over 50+ countries. Through our industry knowledge and quality of service our clients have come to rely upon us to not just keep the cogs turning but also to keep them at the forefront of technologic ... View more