NetOps GPU Cluster Upgrades [Network Operations]

Mexico City - Mexico

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

NetOps - GPU Cluster Upgrades Network Operations

Remote

Expected Outputs:

A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery power cycling disk wiping and OS provisioning of all compute nodes via IPMI/BMC.

A master playbook that can take a fresh Ubuntu install and fully configure it to standard without manual intervention.

A functional slurmctld (controller) and slurmd (on compute nodes) installation. (4) Standard NVIDIA dashboards imported into Grafana to display Real-time GPU Temperature Power Draw and Per-User Usage metrics.

Required Skillset:

The executing vendor would require proven expertise in the following areas:

Bare Metal and OS Management: Canonical MAAS (Metal-as-a-Service) installation and configuration IPMI/BMC and Ubuntu 22.04 LTS operating system provisioning.

Configuration Management: Advanced proficiency in developing and deploying Ansible Playbooks.

NVIDIA Software Stack: Installation and version-locking of NVIDIA Drivers (Headless) CUDA Toolkit (v12.x) and the NVIDIA DCGM Exporter. Ability to interpret NVIDIA XID errors and PCIe bus falling off issues.

Networking: Experience with Infiniband/RDMA networking configuration.

Workload Scheduling: Expertise in installing and configuring Slurm Workload Manager including Fair Share scheduling preemption rules and user management integration (Local or LDAP).

Monitoring and Visualization: Deployment and configuration of Prometheus and Grafana.

NetOps - GPU Cluster Upgrades Network Operations Remote Expected Outputs: A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery power cycling disk wiping and OS provisioning of all compute nodes via IPMI/BMC. A master playbook that can take a fresh Ubuntu in...