Compute Platform Engineer

Not Interested
Bookmark
Report This Job

profile Job Location:

Marshall County, WV - USA

profile Monthly Salary: Not Disclosed
Posted on: 8 hours ago
Vacancies: 1 Vacancy

Job Summary

Compute Platform Engineer
Location: Dallas TX (Hybrid)
Type: Direct Hire

Competitive base salary performance bonus
100% company-paid benefits

Overview

We are seeking a Compute Platform Engineer to support the reliability performance and operational health of large-scale high-performance compute infrastructure supporting critical research and production workloads.

This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms ensuring consistent performance at scale and driving operational excellence across the environment. The position works closely with platform engineering infrastructure operations teams and hardware vendors to support a stable and highly available compute ecosystem.

The ideal candidate brings strong hands-on experience with HPC or AI infrastructure deep knowledge of server hardware and a proactive approach to troubleshooting automation and continuous improvement.

Key Responsibilities

Compute Infrastructure Engineering

Design configure and manage high-performance compute infrastructure composed of CPU and GPU nodes
Support large-scale HPC and AI platforms ensuring systems are stable performant and production-ready
Perform diagnostics tuning and capacity planning to support efficient scale-out of compute environments

Hardware Reliability & Lifecycle Management

Manage full firmware and BIOS lifecycle across compute infrastructure including baselines validation rollout and compliance
Troubleshoot complex hardware issues across CPU GPU DPU NVSwitch NICs memory PSU and BMC components
Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency

Automation & Platform Operations

Automate health checks onboarding workflows and operational processes to improve deployment efficiency
Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
Recommend and implement tooling and process improvements to enhance platform operations

Vendor & Cross-Functional Collaboration

Collaborate with hardware vendors to resolve firmware and system issues providing detailed diagnostics logs and impact analysis
Work closely with infrastructure platform and operations teams to align on system performance and reliability goals
Support integration of hardware improvements across the broader environment

Monitoring Performance & Security

Monitor hardware performance and identify opportunities for optimization
Implement best practices for platform security and system hardening
Ensure adherence to operational standards and data center processes

Technical Leadership

Act as a subject matter expert for compute infrastructure and hardware-related issues
Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence

Required Experience

3 years of hands-on experience supporting large-scale compute platforms HPC or AI infrastructure
Strong experience with HPE server platforms such as ProLiant and Apollo
Experience working with NVIDIA GPUs including A100 H100/H200 or similar
Solid understanding of server architecture including UEFI/BIOS PCIe devices and out-of-band management systems (iLO BMC)
Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
Experience with Linux in high-performance or latency-sensitive environments
Familiarity with core networking concepts including DNS DHCP VLANs switching and routing
Experience working within data center environments and operational processes

Technical Skills

Experience with automation tools such as Ansible Terraform and CI/CD pipelines
Exposure to Infrastructure-as-Code (IaC) practices
Working knowledge of Kubernetes and/or OpenStack (preferred)
Strong problem-solving and analytical skills with the ability to operate in complex environments

Preferred Experience

Experience supporting AI platforms or next-generation GPU architectures
Exposure to large-scale distributed compute environments
Experience working in mission-critical or high-availability infrastructure environments

Compute Platform Engineer Location: Dallas TX (Hybrid) Type: Direct Hire Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking a Compute Platform Engineer to support the reliability performance and operational health of large-scale high-performance compute infr...
View more view more