L2 Datacenter Support Engineer
Job Summary
We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms including NVIDIA GPU clusters InfiniBand fabrics and Kubernetes-based IaaS environments.
This role focuses on deep infrastructure expertise ensuring performance scalability and reliability of the platform layer that powers AI workloads without being responsible for the workloads themselves.
You will play a key role in bare metal lifecycle management advanced InfiniBand troubleshooting and platform stability working closely with engineering teams to operate cutting-edge infrastructure at scale.
Key responsibilities:
- Troubleshoot and maintain InfiniBand fabrics including performance tuning link issues and topology validation.
- Act as the escalation point for L1 for complex infrastructure and hardware issues.
- Own and maintain accurate infrastructure modeling IPAM and source-of-truth data in NetBox.
- Own InfiniBand fabric management and advanced troubleshooting utilizing Verity for configuration monitoring and optimization of high-performance interconnects.
- Diagnose and resolve issues across GPU servers networking storage and Kubernetes platforms.
- Perform deep hardware and system-level diagnostics (GPUs PCIe NICs firmware etc.).
- Support Kubernetes platform stability (node health networking scheduling issues).
- Contribute to automation of provisioning and operational workflows.
- Lead incident response root cause analysis (RCA) and post-incident improvements.
- Collaborate with vendors and internal engineering teams on complex issues.
- Support infrastructure upgrades firmware management and capacity expansion.
Qualifications :
Required Skills & Experience:
- 36 years of experience in infrastructure operations datacenter engineering or cloud platforms.
- Strong Linux systems expertise.
- Hands-on experience with bare metal provisioning systems and lifecycle management.
- Strong experience with InfiniBand networking (troubleshooting performance fabric management using UFM).
- Experience with IPAM/DCIM tools such as NetBox and Ethernet network configuration and validation leveraging Verity.
- Solid understanding of datacenter networking storage and hardware architecture.
- Working knowledge of Kubernetes in production environments.
- Strong troubleshooting skills across hardware and distributed systems.
Preferred qualifications:
- Experience with NVIDIA GPU platforms and accelerated computing infrastructure.
- Familiarity with automation tools (Terraform Ansible etc.).
- Exposure to OpenStack (optional).
- Experience with observability stacks (Prometheus Grafana ELK).
Success in this role:
- Rapid resolution of complex infrastructure and networking issues.
- High reliability and performance of InfiniBand and GPU infrastructure.
- Scalable and efficient bare metal provisioning processes.
- Strong contribution to automation and operational excellence.
- Trusted escalation point and technical leader within the team.
Additional Information :
We offer:
- Work with an established Silicon Valley leader in the cloud infrastructure industry;
- Work with exceptionally passionate talented and engaging colleagues helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies;
- Be a part of cutting-edge open-source innovation;
- Thrive in the high-energy environment of a young company where openness collaboration risk-taking and continuous growth are valued;
- Professional development and training;
- Attend conferences and working groups;
- Company outings happy hours hackathons and tech talks;
- Receive a competitive compensation package with a strong benefits plan.
We are a Leader for Container Management in G2 (#2 after AWS)!
Remote Work :
No
Employment Type :
Full-time
About Company
Mirantis is an open cloud company that helps organizations achieve digital self determination by giving them complete control over their strategic infrastructure. The company combines intelligent automation and cloud-native expertise for managing and operating virtual machines, contai ... View more