Role: Senior GPU Platform Engineer - AI Infrastructure Operations
Location: Redmond WA (4 days a week onsite is must)
Job Type: W2 Contract
Contract length: 1 year
Note: Only Visa Independent candidates required (No C2C or Third-party candidates)
On-site position requiring regular hands-on access to hardware in lab or data center environments.
MUST HAVE SKILLS:
- Configuration Management
- GPGPU/GPU
- Hardware Troubleshooting
- Infrastructure & Operations
- Infrastructure Automation and Orchestration
- Linux Administration
Description:
- Join our team to operate and support cutting-edge GPU infrastructure powering AI and high-performance computing workloads for a leading global hyperscale cloud provider.
- In this hands-on role youll manage the full lifecycle of NVIDIA GPU platforms from bring-up to break/fix while ensuring optimal performance for advanced AI applications.
Responsibilities:
- Operate and maintain production GPU and bare-metal compute platforms with hands-on hardware management
- Perform physical infrastructure tasks including rack/stack cabling power validation and system bring-up
- Diagnose hardware faults replace failed components and coordinate vendor support for complex issues
- Install and configure Linux operating systems with GPU-specific drivers and software stacks
- Execute platform validation using diagnostic tools to ensure GPU health stability and performance
- Provision bare-metal systems through automated workflows while troubleshooting configuration issues
- Apply firmware BIOS and platform configuration changes following standardized change processes
Requirements:
- 5 years professional experience supporting production server infrastructure in data center environments
- Strong Linux administration skills with ability to independently troubleshoot system-level issues
- Hands-on experience with physical server hardware including diagnostics and component replacement
- Familiarity with GPU platforms preferably NVIDIA and associated drivers and software stacks
- Experience working in structured change-controlled production environments
- Knowledge of infrastructure monitoring tools and alert response procedures