Job Description:
Note: Role will require 100% fully onsite 5 days per week for the first 6 months. Candidate will be able to be remote after first 6 months.
Must have skills
PowerEdge Rack/Tower Experience RHEL and Ubuntu OS Experience working in a larger server environment.
PowerEdge XE server experience - PowerEdge XE-series GPU servers and accelerated/HPC environment
Strong experience with HPC operations in production environments (scheduling monitoring troubleshooting capacity management).
Hands on expertise with NVIDIA GPU infrastructure (preferably GB200/GB300 class racks H100/B100 or similar generations) including firmware/driver stack monitoring and lifecycle management.
Familiarity with liquid cooled data center infrastructure: working safely around cold plate / rear door heat exchanger systems understanding facility interfaces (CDU manifolds leak detection etc.).
Solid understanding of Linux administration (RHEL/CentOS/Ubuntu) in HPC/AI clusters: OS provisioning patching performance tuning and troubleshooting.
Experience with cluster management tools (e.g. Bright OpenHPC Slurm PBS/LSF Kubernetes based AI stacks or equivalent schedulers and orchestration frameworks).
Strong Day 2 operations mindset: incident response root-cause analysis change management and operational runbook creation.
Excellent customer facing communication skills and the ability to work on site embedded with the customers team and their operations staff.
Nice to have skills (if any)
Prior exposure to liquid-cooled or rack-scale GPU platforms (Nvidia HGX NVLink etc.) is highly desirable
Prior experience operating or supporting Dell XE/XE HPC & AI solutions and related management tooling.
Background in HP/HPE HPC environments (familiar with how HP is typically deployed/managed in HPC shops to ease comparison and migration).
Exposure to AI/ML and GenAI workloads running at scale on NVIDIA GPU infrastructure (MLOps pipelines model training/inference operations).
Experience with data center facilities coordination (power and cooling planning rack integration cabling standards change windows).
Scripting skills in Python/Bash/PowerShell for automation reporting and integration with monitoring/ITSM systems.
Familiarity with ITIL aligned processes (incident problem and change management) and documenting runbooks/SOPs.
Strong scripting/automation skills especially in Python and Bash for automating operational workflows health checks and reporting across large GPU clusters.
Experience building or maintaining infrastructure automation (e.g. Ansible Terraform or similar) for repeatable deployment configuration and lifecycle management of HPC/AI nodes.
Some scripting skills in Python/Bash/PowerShell for automation reporting and integration with monitoring/ITSM systems.
Experience with DevOps / Git based workflows (GitLab/GitHub CI/CD) to help with scripts automation playbooks and configuration as code.
The resident will be a full time on site operational resource focused on Day 2 operations for liquid cooled Dell XE racks with NVIDIA GB300 based configurations. Scope includes but is not limited to:
Day 2 operations & stability
Own day to day operational support for the initial 48 fully loaded GB300 racks scaling to grow toward 144 racks by year end.
Monitor system health performance and capacity; proactively identify and remediate issues impacting uptime or SLAs.
Perform incident triage troubleshooting and coordination with Dell and NVIDIA support as needed.
Support day to day data center activities
o Proactively walk the data center through the day watching/alerting customer of amber lights hot doors etc.
o Escort dispatched field engineers (when applicable)
o LOIS parts management (will be trained once onsite)
Maintain documentation of the environment: serial numbers elevations runbooks etc.
Coordinate with customer for any maintenance activities: power maintenance cooling maintenance containment wall maintenance etc.
Performs BIOS lifecycle management adding users patching
Triage with Dell support for scheduling FSE break/fix activities
Works at the direction of the customer team to update versions as needed
Resource must be onsite 5 days there are no exceptions
Training important the candidate selected will need to complete XE Server training. This will need to be completed prior to the starting the residency
Infrastructure management
Support rack level and node level lifecycle tasks: firmware/driver updates BIOS tuning OS patching and configuration consistency.
Assist with liquid cooling operations: safe work practices coordination with facilities for maintenance/change activity and monitoring of cooling performance and alarms.
Validate and maintain power cooling and space documentation as the environment scales from 48 to 144 racks.
HPC/AI platform operations
Support cluster scheduler / workload manager operations (job queues resource pools and performance tuning) in the context of GPU heavy workloads.
Work with to ensure that AI/HPC workloads are efficiently utilizing the GB300 infrastructure (GPU utilization job distribution bottleneck analysis).
Help integrate the Dell/NVIDIA environment into existing operational tools (monitoring logging and ticketing systems) where appropriate.
Operational process & enablement
Develop and maintain runbooks SOPs and checklists for common tasks (e.g. node bring up node retirement firmware updates GPU diagnostics cooling system checks).
Provide knowledge transfer and hands on coaching to operations staff so they can confidently manage the environment over time.
Support change management and planning for capacity expansions as racks are added through the year.
Continuous improvement & stakeholder alignment
Act as a trusted advisor on best practices for running large scale liquid cooled GPU environments.
Provide regular operational updates risk/issue reports and recommendations back to Dell account and services teams.
Identify opportunities for additional Dell services or enhancements that reduce risk and improve performance
Job Description: Note: Role will require 100% fully onsite 5 days per week for the first 6 months. Candidate will be able to be remote after first 6 months. Must have skills PowerEdge Rack/Tower Experience RHEL and Ubuntu OS Experience working in a larger server environment. PowerEdge XE ...
Job Description:
Note: Role will require 100% fully onsite 5 days per week for the first 6 months. Candidate will be able to be remote after first 6 months.
Must have skills
PowerEdge Rack/Tower Experience RHEL and Ubuntu OS Experience working in a larger server environment.
PowerEdge XE server experience - PowerEdge XE-series GPU servers and accelerated/HPC environment
Strong experience with HPC operations in production environments (scheduling monitoring troubleshooting capacity management).
Hands on expertise with NVIDIA GPU infrastructure (preferably GB200/GB300 class racks H100/B100 or similar generations) including firmware/driver stack monitoring and lifecycle management.
Familiarity with liquid cooled data center infrastructure: working safely around cold plate / rear door heat exchanger systems understanding facility interfaces (CDU manifolds leak detection etc.).
Solid understanding of Linux administration (RHEL/CentOS/Ubuntu) in HPC/AI clusters: OS provisioning patching performance tuning and troubleshooting.
Experience with cluster management tools (e.g. Bright OpenHPC Slurm PBS/LSF Kubernetes based AI stacks or equivalent schedulers and orchestration frameworks).
Strong Day 2 operations mindset: incident response root-cause analysis change management and operational runbook creation.
Excellent customer facing communication skills and the ability to work on site embedded with the customers team and their operations staff.
Nice to have skills (if any)
Prior exposure to liquid-cooled or rack-scale GPU platforms (Nvidia HGX NVLink etc.) is highly desirable
Prior experience operating or supporting Dell XE/XE HPC & AI solutions and related management tooling.
Background in HP/HPE HPC environments (familiar with how HP is typically deployed/managed in HPC shops to ease comparison and migration).
Exposure to AI/ML and GenAI workloads running at scale on NVIDIA GPU infrastructure (MLOps pipelines model training/inference operations).
Experience with data center facilities coordination (power and cooling planning rack integration cabling standards change windows).
Scripting skills in Python/Bash/PowerShell for automation reporting and integration with monitoring/ITSM systems.
Familiarity with ITIL aligned processes (incident problem and change management) and documenting runbooks/SOPs.
Strong scripting/automation skills especially in Python and Bash for automating operational workflows health checks and reporting across large GPU clusters.
Experience building or maintaining infrastructure automation (e.g. Ansible Terraform or similar) for repeatable deployment configuration and lifecycle management of HPC/AI nodes.
Some scripting skills in Python/Bash/PowerShell for automation reporting and integration with monitoring/ITSM systems.
Experience with DevOps / Git based workflows (GitLab/GitHub CI/CD) to help with scripts automation playbooks and configuration as code.
The resident will be a full time on site operational resource focused on Day 2 operations for liquid cooled Dell XE racks with NVIDIA GB300 based configurations. Scope includes but is not limited to:
Day 2 operations & stability
Own day to day operational support for the initial 48 fully loaded GB300 racks scaling to grow toward 144 racks by year end.
Monitor system health performance and capacity; proactively identify and remediate issues impacting uptime or SLAs.
Perform incident triage troubleshooting and coordination with Dell and NVIDIA support as needed.
Support day to day data center activities
o Proactively walk the data center through the day watching/alerting customer of amber lights hot doors etc.
o Escort dispatched field engineers (when applicable)
o LOIS parts management (will be trained once onsite)
Maintain documentation of the environment: serial numbers elevations runbooks etc.
Coordinate with customer for any maintenance activities: power maintenance cooling maintenance containment wall maintenance etc.
Performs BIOS lifecycle management adding users patching
Triage with Dell support for scheduling FSE break/fix activities
Works at the direction of the customer team to update versions as needed
Resource must be onsite 5 days there are no exceptions
Training important the candidate selected will need to complete XE Server training. This will need to be completed prior to the starting the residency
Infrastructure management
Support rack level and node level lifecycle tasks: firmware/driver updates BIOS tuning OS patching and configuration consistency.
Assist with liquid cooling operations: safe work practices coordination with facilities for maintenance/change activity and monitoring of cooling performance and alarms.
Validate and maintain power cooling and space documentation as the environment scales from 48 to 144 racks.
HPC/AI platform operations
Support cluster scheduler / workload manager operations (job queues resource pools and performance tuning) in the context of GPU heavy workloads.
Work with to ensure that AI/HPC workloads are efficiently utilizing the GB300 infrastructure (GPU utilization job distribution bottleneck analysis).
Help integrate the Dell/NVIDIA environment into existing operational tools (monitoring logging and ticketing systems) where appropriate.
Operational process & enablement
Develop and maintain runbooks SOPs and checklists for common tasks (e.g. node bring up node retirement firmware updates GPU diagnostics cooling system checks).
Provide knowledge transfer and hands on coaching to operations staff so they can confidently manage the environment over time.
Support change management and planning for capacity expansions as racks are added through the year.
Continuous improvement & stakeholder alignment
Act as a trusted advisor on best practices for running large scale liquid cooled GPU environments.
Provide regular operational updates risk/issue reports and recommendations back to Dell account and services teams.
Identify opportunities for additional Dell services or enhancements that reduce risk and improve performance
View more
View less