Job Responsibilities:
- Maintain the HPC systems availability to the customer.
- Lead technical output of onsite client HW technicians system admins and system analysts.
- Serve as primary customer focal point for system support of systems and onsite activities.
- Fulltime 100% presence on customer site for standard business hours.
- Routine facetoface and group interaction with site team to organize tasks follow up and assist with challenges they encounter.
- Track system health and Cases review regularly (weekly) with customers and HPC leadership.
- Maintaining availability reports for tracking SLAs.
- Preplan system upgrades; review plans with team and customers arrange for staffing and equipment including prearrange open lines of communication in case of issues.
- Escalate Cases and assist team members escalating Cases to nexttier support and follow up to drive closure via escalation processes.
- Manage onsite parts inventory using business tools.
- Manage site tools and equipment.
- Maintaining the oncall schedule to support our 365 24x7 contracts.
- Assisting with hardware and system installation activities in new systems.
Team Support
- Build strong working relationships with teammates leadership and customers.
- Maintain awareness of upcoming training and prompt team members to complete trainings.
- Maintain a team calendar of planned leave including oncall schedule for operational issues.
- Provide performance review input to the District Service Manager (DSM) and suggestions for team member performance and development.
- Escalate to DSM any personnel issues risk of missing SLA or customer satisfaction concerns.
- Maintain a clean and safe working environment.
- Support DSM in onboarding new team members by providing sitespecific details (e.g. customer network accounts badge parking etc.).
Required Qualifications & Experience:
- 8 years of professional experience and a Bachelor of Arts/Science or equivalent degree in computer science or related area of study; without a degree three additional years of relevant professional experience (11 years in total).
- Indepth knowledge of highperformance computing (HPC) systems.
- Proficiency in managing and optimizing HPC environments including system configuration performance tuning and troubleshooting.
- Strong understanding of parallel computing cluster management and distributed computing technologies.
- Experience with HPC workload managers and schedulers such as SLURM PBS or similar.
- Advanced knowledge of Linux operating systems.
- Familiarity with software development tools and environments commonly used in HPC including compilers debuggers and performance analysis tools.
- Experience with various scripting languages such as Python or Bash.
- Proven experience in system administration including hardware and software installation maintenance and upgrades.
- Knowledge of network architecture storage solutions and data management within HPC environments.
- Ability to implement and manage security protocols and best practices in a highperformance computing context to maintain customer security posture.
- Strong project management skills including planning execution and monitoring of HPC projects.
- Ability to lead and coordinate a team of technical professionals ensuring timely and successful project delivery.
- Experience in resource allocation budgeting and performance metrics tracking for HPC projects.
- Excellent problemsolving abilities with a focus on identifying root causes and implementing effective solutions.
- Strong analytical skills to assess system performance and make datadriven decisions for optimization.
- Ability to troubleshoot complex technical issues in a highstakes HPC environment.
- Exceptional communication skills both written and verbal to effectively interact with team members stakeholders and clients.
- Ability to convey complex technical information in a clear and concise manner to nontechnical audiences.
- Strong collaboration skills to work effectively within a multidisciplinary team and across organizational boundaries.
- Extensive experience in HPC system management and administration with a track record of successful project and team leadership.
- Willingness to participate in ongoing professional development and training opportunities which may require travel.
Preferred Qualifications:
- CompTIA A or Server Certification
- Security Certification
- Linux Certification
- PMP or Project
- Vendor Certifications
- Experience with tickettracking software (Salesforce SmartSheets; any ticket tracking is good)