Role- HPC consultant
Location- Fremont CA/ Tualatin OR
HPC Cluster & Scheduler Management
- Design configure tune and optimize SLURM partitions queues QoS and scheduling policies to maximize cluster utilization and workload efficiency.
- Perform in-depth analysis of job scheduling behavior bottlenecks and resource contention.
- Troubleshoot job failures performance degradation and scheduler-related issues in production HPC environments.
- Implement fair-share backfill reservations and policy-driven scheduling as required.
Storage Benchmarking & Procurement Support
- Lead HPC storage performance benchmarking using industry-standard tools (e.g. IOR FIO MDTest IOzone).
- Analyze I/O patterns of HPC workloads and map them to appropriate storage architectures (parallel file systems NVMe Lustre Spectrum Scale etc.).
- Provide technical input for storage selection and procurement including performance expectations sizing and cost-performance tradeoffs.
- Collaborate with vendors and internal teams during POCs and performance validation exercises.
HPC Application Build & Optimization
- Build install configure and maintain HPC applications compilers libraries and scientific software stacks.
- Optimize application performance using MPI OpenMP GPU acceleration (where applicable) and tuned math libraries.
- Support multiple compiler toolchains (GCC Intel LLVM NVIDIA HPC SDK etc.).
- Implement and manage environment modules (Lmod) or similar software management frameworks.
System Performance & Operations
- Conduct system-level performance tuning across compute memory network and storage layers.
- Diagnose node-level issues involving CPU GPU interconnects (InfiniBand/Ethernet) and OS configurations.
- Create operational runbooks performance baselines and troubleshooting documentation.
- Support cluster upgrades expansions and hardware refresh activities.
Collaboration & Delivery
- Work closely with application owners researchers and infrastructure teams to meet aggressive delivery timelines.
- Translate workload requirements into practical HPC configurations and optimizations.
- Provide clear technical guidance and recommendations to leadership and stakeholders.
Required Skills & Experience
Core HPC Skills
- 8 12 years of hands-on HPC engineering experience in production environments.
- Strong expertise with SLURM (configuration tuning troubleshooting).
- Solid understanding of Linux systems (RHEL/CentOS/Rocky/Alma preferred).
- Deep knowledge of HPC storage systems and I/O performance analysis.
- Proven experience building and optimizing HPC applications and libraries.
Technical Proficiency
- MPI implementations (Open MPI MPICH) OpenMP
- Compilers and toolchains (GCC Intel NVIDIA HPC SDK)
- Performance tools (perf vtune nvprof/nsys IB diagnostics)
- Environment modules (Lmod) package managers (Spack preferred)
- Bash/Python scripting for automation and diagnostics
Nice to Have
- Experience with GPU-based HPC workloads (NVIDIA CUDA ROCm).
- Exposure to cloud-based HPC (Azure AWS GCP).
- Familiarity with parallel file systems such as Lustre or IBM Spectrum Scale.
- Vendor engagement experience for HPC hardware/storage evaluations.
Role- HPC consultant Location- Fremont CA/ Tualatin OR HPC Cluster & Scheduler Management Design configure tune and optimize SLURM partitions queues QoS and scheduling policies to maximize cluster utilization and workload efficiency. Perform in-depth analysis of job scheduling behavior bottlenec...
Role- HPC consultant
Location- Fremont CA/ Tualatin OR
HPC Cluster & Scheduler Management
- Design configure tune and optimize SLURM partitions queues QoS and scheduling policies to maximize cluster utilization and workload efficiency.
- Perform in-depth analysis of job scheduling behavior bottlenecks and resource contention.
- Troubleshoot job failures performance degradation and scheduler-related issues in production HPC environments.
- Implement fair-share backfill reservations and policy-driven scheduling as required.
Storage Benchmarking & Procurement Support
- Lead HPC storage performance benchmarking using industry-standard tools (e.g. IOR FIO MDTest IOzone).
- Analyze I/O patterns of HPC workloads and map them to appropriate storage architectures (parallel file systems NVMe Lustre Spectrum Scale etc.).
- Provide technical input for storage selection and procurement including performance expectations sizing and cost-performance tradeoffs.
- Collaborate with vendors and internal teams during POCs and performance validation exercises.
HPC Application Build & Optimization
- Build install configure and maintain HPC applications compilers libraries and scientific software stacks.
- Optimize application performance using MPI OpenMP GPU acceleration (where applicable) and tuned math libraries.
- Support multiple compiler toolchains (GCC Intel LLVM NVIDIA HPC SDK etc.).
- Implement and manage environment modules (Lmod) or similar software management frameworks.
System Performance & Operations
- Conduct system-level performance tuning across compute memory network and storage layers.
- Diagnose node-level issues involving CPU GPU interconnects (InfiniBand/Ethernet) and OS configurations.
- Create operational runbooks performance baselines and troubleshooting documentation.
- Support cluster upgrades expansions and hardware refresh activities.
Collaboration & Delivery
- Work closely with application owners researchers and infrastructure teams to meet aggressive delivery timelines.
- Translate workload requirements into practical HPC configurations and optimizations.
- Provide clear technical guidance and recommendations to leadership and stakeholders.
Required Skills & Experience
Core HPC Skills
- 8 12 years of hands-on HPC engineering experience in production environments.
- Strong expertise with SLURM (configuration tuning troubleshooting).
- Solid understanding of Linux systems (RHEL/CentOS/Rocky/Alma preferred).
- Deep knowledge of HPC storage systems and I/O performance analysis.
- Proven experience building and optimizing HPC applications and libraries.
Technical Proficiency
- MPI implementations (Open MPI MPICH) OpenMP
- Compilers and toolchains (GCC Intel NVIDIA HPC SDK)
- Performance tools (perf vtune nvprof/nsys IB diagnostics)
- Environment modules (Lmod) package managers (Spack preferred)
- Bash/Python scripting for automation and diagnostics
Nice to Have
- Experience with GPU-based HPC workloads (NVIDIA CUDA ROCm).
- Exposure to cloud-based HPC (Azure AWS GCP).
- Familiarity with parallel file systems such as Lustre or IBM Spectrum Scale.
- Vendor engagement experience for HPC hardware/storage evaluations.
View more
View less