Data Center Operations Engineer

Santa Fe, NM - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

At Cadence we hire and develop leaders and innovators who want to make an impact on the world of technology.

Job Summary

The Data Center Operations Engineer is responsible for supporting maintaining and deploying critical data center infrastructure with a strong focus on Linux-based systems GPU server deployments and InfiniBand networking. This role requires hands-on expertise in data center operations cluster bring-up hardware installation and troubleshooting across compute network and GPU environments. The engineer will collaborate closely with global infrastructure development and operations teams to ensure reliable secure and scalable service delivery.

Key Responsibilities

Provide hands-on operational support for all data center projects deployments and repair activities.
Participate in an on-call rotation and provide on-site or remote support during maintenance windows and incidents.
Troubleshoot and resolve operational issues related to Linux servers GPU platforms networking and storage infrastructure.
Support customer and internal deployments ensuring timely and successful bring-up of GPU servers and clusters.
Perform InfiniBand fabric bring-up switch configuration subnet management and troubleshooting.
Conduct daily health checks of Linux systems and infrastructure components proactively identifying and mitigating risks.
Install configure test and maintain server hardware (rack and stack labeling HDDs memory CPUs RAID batteries NICs etc.).
Install configure and troubleshoot networking equipment including routers switches and terminal servers for out-of-band management.
Review and validate equipment deployments against approved design documentation and standards.
Support data center builds refreshes migrations and expansions while adhering to quality and safety standards.
Coordinate with vendors and onsite staff for hardware delivery diagnostics replacement and warranty services.
Utilize monitoring and alerting frameworks to identify issues escalate appropriately and ensure timely service restoration.
Maintain accurate documentation of operational procedures system configurations and runbooks.
Follow established incident management escalation procedures and service-level agreements (SLAs).
Collaborate with global teams across time zones to support operational initiatives and continuous improvement efforts.
Contribute to process improvement initiatives and ensure adherence to documented policies processes and procedures.

Required Qualifications

Bachelors degree in Computer Science Engineering Information Technology or equivalent practical experience.
Strong hands-on experience in Linux environments including system administration troubleshooting and performance validation.
Proficiency with Linux command-line tools and shell scripting (Bash or equivalent).
Experience with cluster bring-up driver installation and system-level configuration.
Hands-on experience setting up and validating GPU servers in clustered environments.
Experience with end-to-end GPU testing in InfiniBand-based clusters.
Working knowledge of InfiniBand networking including switch configuration and subnet management.
Solid understanding of networking fundamentals including the OSI model and TCP/IP protocol suite (IP ARP ICMP TCP UDP SMTP FTP TFTP).
Experience installing configuring and troubleshooting routers switches and terminal servers.
Familiarity with fiber and copper cabling including IP and SAN deployments.
Experience managing incident tickets maintaining acceptable ticket loads and meeting SLAs.
Strong organizational skills with meticulous attention to detail in data center environments.
Ability to follow and enforce documented escalation procedures and operational policies.
Strong verbal and written communication skills with the ability to collaborate effectively with cross-functional and global teams.

Preferred Qualifications

Experience supporting HPC AI or large-scale GPU environments.
Exposure to data center monitoring
Experience documenting operational processes and maintaining technical runbooks.
Familiarity with large-scale data center buildouts or refresh programs.

Physical Requirements

Ability to perform the essential functions of the role including lifting moving and installing equipment weighing 50 pounds or more with or without reasonable accommodation.
Ability to work in data center environments including raised floors equipment racks and confined spaces.
Willingness to work flexible hours including nights weekends and on-call rotations as required.

Work Environment

On-site data center environment with occasional remote coordination.
Interaction with hardware vendors service providers and internal engineering teams.
Fast-paced operational setting requiring attention to detail adherence to safety standards and rapid problem resolution.

Were doing work that matters. Help us solve what others cant.

Required Experience:

At Cadence we hire and develop leaders and innovators who want to make an impact on the world of technology.Job SummaryThe Data Center Operations Engineer is responsible for supporting maintaining and deploying critical data center infrastructure with a strong focus on Linux-based systems GPU server...

At Cadence we hire and develop leaders and innovators who want to make an impact on the world of technology.

Job Summary

Key Responsibilities

Provide hands-on operational support for all data center projects deployments and repair activities.
Participate in an on-call rotation and provide on-site or remote support during maintenance windows and incidents.
Troubleshoot and resolve operational issues related to Linux servers GPU platforms networking and storage infrastructure.
Support customer and internal deployments ensuring timely and successful bring-up of GPU servers and clusters.
Perform InfiniBand fabric bring-up switch configuration subnet management and troubleshooting.
Conduct daily health checks of Linux systems and infrastructure components proactively identifying and mitigating risks.
Install configure test and maintain server hardware (rack and stack labeling HDDs memory CPUs RAID batteries NICs etc.).
Install configure and troubleshoot networking equipment including routers switches and terminal servers for out-of-band management.
Review and validate equipment deployments against approved design documentation and standards.
Support data center builds refreshes migrations and expansions while adhering to quality and safety standards.
Coordinate with vendors and onsite staff for hardware delivery diagnostics replacement and warranty services.
Utilize monitoring and alerting frameworks to identify issues escalate appropriately and ensure timely service restoration.
Maintain accurate documentation of operational procedures system configurations and runbooks.
Follow established incident management escalation procedures and service-level agreements (SLAs).
Collaborate with global teams across time zones to support operational initiatives and continuous improvement efforts.
Contribute to process improvement initiatives and ensure adherence to documented policies processes and procedures.

Required Qualifications

Bachelors degree in Computer Science Engineering Information Technology or equivalent practical experience.
Strong hands-on experience in Linux environments including system administration troubleshooting and performance validation.
Proficiency with Linux command-line tools and shell scripting (Bash or equivalent).
Experience with cluster bring-up driver installation and system-level configuration.
Hands-on experience setting up and validating GPU servers in clustered environments.
Experience with end-to-end GPU testing in InfiniBand-based clusters.
Working knowledge of InfiniBand networking including switch configuration and subnet management.
Solid understanding of networking fundamentals including the OSI model and TCP/IP protocol suite (IP ARP ICMP TCP UDP SMTP FTP TFTP).
Experience installing configuring and troubleshooting routers switches and terminal servers.
Familiarity with fiber and copper cabling including IP and SAN deployments.
Experience managing incident tickets maintaining acceptable ticket loads and meeting SLAs.
Strong organizational skills with meticulous attention to detail in data center environments.
Ability to follow and enforce documented escalation procedures and operational policies.
Strong verbal and written communication skills with the ability to collaborate effectively with cross-functional and global teams.

Preferred Qualifications

Experience supporting HPC AI or large-scale GPU environments.
Exposure to data center monitoring
Experience documenting operational processes and maintaining technical runbooks.
Familiarity with large-scale data center buildouts or refresh programs.

Physical Requirements

Ability to perform the essential functions of the role including lifting moving and installing equipment weighing 50 pounds or more with or without reasonable accommodation.
Ability to work in data center environments including raised floors equipment racks and confined spaces.
Willingness to work flexible hours including nights weekends and on-call rotations as required.

Work Environment

On-site data center environment with occasional remote coordination.
Interaction with hardware vendors service providers and internal engineering teams.
Fast-paced operational setting requiring attention to detail adherence to safety standards and rapid problem resolution.

Were doing work that matters. Help us solve what others cant.

Required Experience:

Key Skills

Children Activity
Feed
ASP.NET
Corporate Strategy
Health And Safety Management

Apply Now

About Company

Cadence Systems

Do you want to shape the future of technology? Cadence is leading the charge to solve some of technology’s toughest challenges. We work with the world’s most innovative companies, across a growing range of industries. Major trends that you hear about everyday – like artificial intell ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click