Compute Platform Engineer
Location: Dallas TX (Hybrid)
Type: Direct Hire
Competitive base salary performance bonus
100% company-paid benefits
Overview
We are seeking a Compute Platform Engineer to support the reliability performance and operational health of large-scale high-performance compute infrastructure supporting critical research and production workloads.
This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms ensuring consistent performance at scale and driving operational excellence across the environment. The position works closely with platform engineering infrastructure operations teams and hardware vendors to support a stable and highly available compute ecosystem.
The ideal candidate brings strong hands-on experience with HPC or AI infrastructure deep knowledge of server hardware and a proactive approach to troubleshooting automation and continuous improvement.
Key Responsibilities
Compute Infrastructure Engineering
Design configure and manage high-performance compute infrastructure composed of CPU and GPU nodes
Support large-scale HPC and AI platforms ensuring systems are stable performant and production-ready
Perform diagnostics tuning and capacity planning to support efficient scale-out of compute environments
Hardware Reliability & Lifecycle Management
Manage full firmware and BIOS lifecycle across compute infrastructure including baselines validation rollout and compliance
Troubleshoot complex hardware issues across CPU GPU DPU NVSwitch NICs memory PSU and BMC components
Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency
Automation & Platform Operations
Automate health checks onboarding workflows and operational processes to improve deployment efficiency
Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
Recommend and implement tooling and process improvements to enhance platform operations
Vendor & Cross-Functional Collaboration
Collaborate with hardware vendors to resolve firmware and system issues providing detailed diagnostics logs and impact analysis
Work closely with infrastructure platform and operations teams to align on system performance and reliability goals
Support integration of hardware improvements across the broader environment
Monitoring Performance & Security
Monitor hardware performance and identify opportunities for optimization
Implement best practices for platform security and system hardening
Ensure adherence to operational standards and data center processes
Technical Leadership
Act as a subject matter expert for compute infrastructure and hardware-related issues
Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence
Required Experience
3 years of hands-on experience supporting large-scale compute platforms HPC or AI infrastructure
Strong experience with HPE server platforms such as ProLiant and Apollo
Experience working with NVIDIA GPUs including A100 H100/H200 or similar
Solid understanding of server architecture including UEFI/BIOS PCIe devices and out-of-band management systems (iLO BMC)
Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
Experience with Linux in high-performance or latency-sensitive environments
Familiarity with core networking concepts including DNS DHCP VLANs switching and routing
Experience working within data center environments and operational processes
Technical Skills
Experience with automation tools such as Ansible Terraform and CI/CD pipelines
Exposure to Infrastructure-as-Code (IaC) practices
Working knowledge of Kubernetes and/or OpenStack (preferred)
Strong problem-solving and analytical skills with the ability to operate in complex environments
Preferred Experience
Experience supporting AI platforms or next-generation GPU architectures
Exposure to large-scale distributed compute environments
Experience working in mission-critical or high-availability infrastructure environments
Compute Platform Engineer Location: Dallas TX (Hybrid) Type: Direct Hire Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking a Compute Platform Engineer to support the reliability performance and operational health of large-scale high-performance compute infr...
Compute Platform Engineer
Location: Dallas TX (Hybrid)
Type: Direct Hire
Competitive base salary performance bonus
100% company-paid benefits
Overview
We are seeking a Compute Platform Engineer to support the reliability performance and operational health of large-scale high-performance compute infrastructure supporting critical research and production workloads.
This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms ensuring consistent performance at scale and driving operational excellence across the environment. The position works closely with platform engineering infrastructure operations teams and hardware vendors to support a stable and highly available compute ecosystem.
The ideal candidate brings strong hands-on experience with HPC or AI infrastructure deep knowledge of server hardware and a proactive approach to troubleshooting automation and continuous improvement.
Key Responsibilities
Compute Infrastructure Engineering
Design configure and manage high-performance compute infrastructure composed of CPU and GPU nodes
Support large-scale HPC and AI platforms ensuring systems are stable performant and production-ready
Perform diagnostics tuning and capacity planning to support efficient scale-out of compute environments
Hardware Reliability & Lifecycle Management
Manage full firmware and BIOS lifecycle across compute infrastructure including baselines validation rollout and compliance
Troubleshoot complex hardware issues across CPU GPU DPU NVSwitch NICs memory PSU and BMC components
Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency
Automation & Platform Operations
Automate health checks onboarding workflows and operational processes to improve deployment efficiency
Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
Recommend and implement tooling and process improvements to enhance platform operations
Vendor & Cross-Functional Collaboration
Collaborate with hardware vendors to resolve firmware and system issues providing detailed diagnostics logs and impact analysis
Work closely with infrastructure platform and operations teams to align on system performance and reliability goals
Support integration of hardware improvements across the broader environment
Monitoring Performance & Security
Monitor hardware performance and identify opportunities for optimization
Implement best practices for platform security and system hardening
Ensure adherence to operational standards and data center processes
Technical Leadership
Act as a subject matter expert for compute infrastructure and hardware-related issues
Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence
Required Experience
3 years of hands-on experience supporting large-scale compute platforms HPC or AI infrastructure
Strong experience with HPE server platforms such as ProLiant and Apollo
Experience working with NVIDIA GPUs including A100 H100/H200 or similar
Solid understanding of server architecture including UEFI/BIOS PCIe devices and out-of-band management systems (iLO BMC)
Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
Experience with Linux in high-performance or latency-sensitive environments
Familiarity with core networking concepts including DNS DHCP VLANs switching and routing
Experience working within data center environments and operational processes
Technical Skills
Experience with automation tools such as Ansible Terraform and CI/CD pipelines
Exposure to Infrastructure-as-Code (IaC) practices
Working knowledge of Kubernetes and/or OpenStack (preferred)
Strong problem-solving and analytical skills with the ability to operate in complex environments
Preferred Experience
Experience supporting AI platforms or next-generation GPU architectures
Exposure to large-scale distributed compute environments
Experience working in mission-critical or high-availability infrastructure environments
View more
View less