Other Service Line Program Manager PRGMGR

ReqRoute,Inc

Job Location:

Raleigh, WV - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Onsite - 333 Lakeside Dr Foster City CA 94404 United States
Max vendor rate is $70/hr
8 hours/per day

Senior Cloud Engineer

Top Skills Required for this role :

1. HPC High performance computing
cloud services
CI CD

Scope of Work:

HPC Cluster Deployment:

Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager.
Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.
Set up and configure HPC clusters to meet specific requirements and workloads.
Manage and maintain HPC hardware components such as CPUs and GPUs along with the necessary software.
Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters.

Workload Scheduler Management:

Install and configure workload managers and schedulers like LSF SLURM and PBS Pro.
Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.
Develop and manage resource policies and rules to optimize cluster performance.
Configure and allocate resources such as CPU and memory and profile applications for optimal performance.
Address and resolve issues related to schedulers daemons and license servers.

Network and High-Performance Connectivity Management:

Install and configure HPC interconnect networks.
Design and configure the network topology for HPC clusters.
Ensure the maintenance and monitoring of InfiniBand connectivity.
Resolve connectivity issues related to InfiniBand RoCE and Ethernet.

Monitoring and Reports:

Produce daily health check reports for the HPC cluster.
Automate monitoring scripts to streamline the monitoring process.
Conduct periodic reviews of reports and audit trails.

OS Administration and Management:

Install and configure operating systems for HPC clusters.
Address OS-related issues such as CPU memory and SWAP utilization and perform application file system cleanup.
Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.

Applications and Tools:

Install HPC libraries and tools such as MPI and compilers.
Install and configure HPC applications both commercial off-the-shelf (COTS) and open source and manage packages using Spack.
Apply patches and upgrades to HPC applications.
Resolve issues related to HPC applications.

HPC Storage Management:

Administer and configure HPC storage systems.
Oversee the administration of HPC file systems.
Monitor and troubleshoot HPC storage systems.
Manage backup and tape library systems.

Key Responsibilities

Cluster Management: Install configure and maintain compute nodes GPUs (NVIDIA) high-speed storage (Lustre GPFS) and interconnects (InfiniBand RoCE).

Performance Tuning: Optimize scientific applications kernels and workflows for maximum throughput scalability and minimal queue times.
User Support: Act as a technical expert for researchers debugging jobs resolving complex issues and providing training on tools and best practices.
Software Management: Manage workload managers (Slurm LSF) schedulers software licensing (FlexLM) OpenPBS containers (Singularity) and compilers.

Infrastructure: Administer high-speed interconnects (InfiniBand) storage (Lustre CEPH) and potentially cloud/hybrid solutions.
Implement and manage monitoring (Grafana Prometheus) and orchestration tools (Slurm Kubernetes).
Automation: Develop scripts (Python Ansible) for provisioning monitoring and automating routine tasks.
Security & Policy: Implement and enforce security policies manage user access and oversee lifecycle management.

Essential Skills & Qualifications

Technical Expertise: Strong Linux Python scripting (Ansible Terraform) HPC schedulers (Slurm) networking (InfiniBand) and GPU computing.

Team will have knowledge of Gilead systems and AWS CICD pipelines.

HPC Domain Knowledge: Experience with parallel file systems workload management and performance analysis tools.
Problem Solving: Excellent analytical and debugging skills for complex distributed systems.
Communication: Ability to explain complex technical issues to scientists and non-technical stakeholders.

Experience: Hands-on experience in data centers managing large clusters and supporting diverse scientific/AI workloads. Project Code :

Apply Now

About Company

ReqRoute,Inc

Previous Next We Are Hiring! Search Staffing & Managed Services Recruitment Branding Solutions Product Engineering Solutions Why Reqroute? Our Market niche is towards Social Media Recruiting and we effectively use social media platforms to reach out to a pool of active/passive candida ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click