NetOps - GPU Cluster Upgrades Network Operations
Remote
Expected Outputs:
A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery power cycling disk wiping and OS provisioning of all compute nodes via IPMI/BMC.
A master playbook that can take a fresh Ubuntu install and fully configure it to standard without manual intervention.
A functional slurmctld (controller) and slurmd (on compute nodes) installation. (4) Standard NVIDIA dashboards imported into Grafana to display Real-time GPU Temperature Power Draw and Per-User Usage metrics.
Required Skillset:
The executing vendor would require proven expertise in the following areas:
Bare Metal and OS Management: Canonical MAAS (Metal-as-a-Service) installation and configuration IPMI/BMC and Ubuntu 22.04 LTS operating system provisioning.
Configuration Management: Advanced proficiency in developing and deploying Ansible Playbooks.
NVIDIA Software Stack: Installation and version-locking of NVIDIA Drivers (Headless) CUDA Toolkit (v12.x) and the NVIDIA DCGM Exporter. Ability to interpret NVIDIA XID errors and PCIe bus falling off issues.
Networking: Experience with Infiniband/RDMA networking configuration.
Workload Scheduling: Expertise in installing and configuring Slurm Workload Manager including Fair Share scheduling preemption rules and user management integration (Local or LDAP).
Monitoring and Visualization: Deployment and configuration of Prometheus and Grafana.
NetOps - GPU Cluster Upgrades Network Operations Remote Expected Outputs: A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery power cycling disk wiping and OS provisioning of all compute nodes via IPMI/BMC. A master playbook that can take a fresh Ubuntu in...
NetOps - GPU Cluster Upgrades Network Operations
Remote
Expected Outputs:
A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery power cycling disk wiping and OS provisioning of all compute nodes via IPMI/BMC.
A master playbook that can take a fresh Ubuntu install and fully configure it to standard without manual intervention.
A functional slurmctld (controller) and slurmd (on compute nodes) installation. (4) Standard NVIDIA dashboards imported into Grafana to display Real-time GPU Temperature Power Draw and Per-User Usage metrics.
Required Skillset:
The executing vendor would require proven expertise in the following areas:
Bare Metal and OS Management: Canonical MAAS (Metal-as-a-Service) installation and configuration IPMI/BMC and Ubuntu 22.04 LTS operating system provisioning.
Configuration Management: Advanced proficiency in developing and deploying Ansible Playbooks.
NVIDIA Software Stack: Installation and version-locking of NVIDIA Drivers (Headless) CUDA Toolkit (v12.x) and the NVIDIA DCGM Exporter. Ability to interpret NVIDIA XID errors and PCIe bus falling off issues.
Networking: Experience with Infiniband/RDMA networking configuration.
Workload Scheduling: Expertise in installing and configuring Slurm Workload Manager including Fair Share scheduling preemption rules and user management integration (Local or LDAP).
Monitoring and Visualization: Deployment and configuration of Prometheus and Grafana.
View more
View less