HPC System Administrator

Not Interested
Bookmark
Report This Job

profile Job Location:

Santa Clara County, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 19 hours ago
Vacancies: 1 Vacancy

Job Summary

Position Title:

HPC System Administrator

Position Type:

Regular

Hiring Range:

$129000 - $161265/annually;Compensation will be based on education experience skills relevant to the role and internal equity.

Pay Frequency:

Annual

A. POSITION PURPOSE

The High-Performance Computing (HPC) System Administrator is an expert hands-on role responsible for the design configuration optimization and operation of the organizations high-performance computing infrastructure. This individual will focus on advanced system optimization complex troubleshooting and strategic planning for future infrastructure enhancements across compute storage and high-speed interconnects (InfiniBand). A key responsibility is to mentor and cross-train existing system administrators building the teams collective HPC expertise strengthening shared support capabilities and ensuring long-term operational resilience and efficiency.

The HPC Systems Administrator is a member of the Enterprise Systems team within the Cyberinfrastructure Technologies department. The incumbent works with the other Cyberinfrastructure teams - Network and Telecommunications Enterprise Applications and the Information Security Office - and other campus divisions in coordinating services providing support and providing appropriate guidance. This incumbent will also work with University vendors and partners.

The HPC Systems Administrator will have a passion for providing excellent customer service and a focus on continual improvement across all units; a commitment to supporting innovative infrastructure technologies; and a desire to identify and deliver the best possible technology resources and services to meet the needs of the campus community.

B. ESSENTIAL DUTIES AND RESPONSIBILITIES

1. HPC Infrastructure Management and Optimization

  • Compute: Manages the entire lifecycle of all compute nodes including procuring installing configuring and maintaining hardware operating systems and core system software to ensure optimal performance stability and resource utilization for scientific workloads.

  • Storage: Directs the management of the high-performance parallel file systems (e.g. Lustre GPFS) NAS and backup solutions executing capacity planning performance tuning and integrity checks to guarantee secure high-speed and reliable data access for all users.

  • InfiniBand: Designs deploys and provides expert-level troubleshooting and maintenance for the InfiniBand high-speed interconnect fabric ensuring low-latency high-bandwidth inter-node communication essential for scalable HPC application performance.

2. Workload Management and System Deployment

  • Slurm: Administers configures and tunes the Slurm Workload Manager actively managing job queues partitions and resource allocation policies to enforce fair-share scheduling maximize cluster utilization and meet diverse research computational needs.

  • System Imaging: Develops maintains and updates standardized optimized system images for all compute nodes utilizing automation tools to facilitate rapid consistent deployment efficient patching and streamlined upgrades across the cluster environment.

  • Software Licenses: Oversees the administration and compliance of all commercial scientific software licenses ensuring adherence to vendor agreements and strategically managing license servers and usage policies to optimize utilization and accessibility for the HPC user base.

3. Team Development and Strategic Planning

  • Knowledge Transfer: Develops and implements a formal cross-training program for existing system administrators by creating documentation and delivering hands-on instruction to enhance the teams collective expertise in HPC-specific technologies (Slurm InfiniBand parallel file systems).

  • Operational Resilience: Ensures robust shared support capabilities across the IT team by strategically transferring HPC knowledge actively preventing single points of failure and improving the overall efficiency and responsiveness of the operational support model.

  • Strategic Enhancement: Contributes to the strategic planning and roadmap development for future HPC infrastructure and software enhancements by researching emerging technologies evaluating vendor solutions and providing expert recommendations to ensure the environment remains cutting-edge and meets long-term organizational goals.

4. Coordination and Collaboration

  • Use broad expertise and unique skills to play an active role as a technical expert during the planning and implementation phases of new technologies and participate in architecture brainstorming and design discussions with technical team members.

  • Provide technical guidance on complex infrastructure architecture challenges to IS team members and other solution partners.

  • Act as a role model for developing and trying different problem-solving approaches and supporting team members to do the same.

  • Coaches and develops new team members on how to provide the best customer service.

  • Models and supports other team members to conduct themselves with openness and honesty to enhance positive relationships based on trust predictability and communication.

5. Resource Planning

  • Provide input on setting Enterprise Systems and CIT goals objectives and strategies based on the Universitys mission goals and strategic plan.

  • Provide input in technology planning processes to develop cost-effective customer-focused solutions.

  • Uses strong technical and organizational knowledge to plan and lead projects and working groups.

6. Service Delivery

  • Work closely with the ES Manager in the creation planning maintenance and secure expansion of SCUs computing infrastructure. This includes but is not limited to local and hosted servers virtual appliances and devices and storage.

  • Work closely with ES Manager to ensure that architecture principles and standards are consistently applied across the data center compute and storage services.

  • Collaborate with the Information Security Office (ISO) to ensure a secure and compliant enterprise environment.

  • Work with the ISO to ensure that systems are secure and to plan for future security needs and threats.

  • Ensure the appropriate distribution of infrastructure services to faculty staff and students.

  • Create and document standards and practices regarding data center compute and storage services for use across the University.

  • Oversee the creation and performance of infrastructure production and test environments.

  • Create scalable interoperable and flexible infrastructure solutions.

  • Support assigned systems with on-call availability and respond within agreed upon timeframes.

  • Analyze and evaluate processes to document and implement standard routine and process for the application of patches/updates to operating systems applications and hardware and firmware to ensure all physical virtual and hosted systems are patched with the appropriate level of security and versioning.

  • Participate as necessary in backup operations ensuring all required file systems and system data are successfully backed up to the appropriate media and are available off site.

  • Participate in disaster recovery and business continuity planning.

  • Perform daily system monitoring verifying integrity and availability of all hardware server resources systems and key processes. Check for potential problems resource availability capacity performance and load characteristics network integrity and security threats. Monitor systems activity and usage to maintain a secure environment. Develop related solutions as warranted .

  • Work with the CISO and system stakeholders to establish upgrade and update schedules and maintenance windows.

  • Keep abreast of software releases and updates keeping all systems at current release levels as appropriate for the successful operation of the data center in support of the University.

  • Serve as the liaison with hosted platform and third party providers to monitor service level agreements and ensure that performance expectations and requirements are met.

7. Service Optimization

  • Enhance existing architecture frameworks in order to define design and implement simplified standards based system architectures.

  • Assist in the design planning and implementation of infrastructure systems optimization and process improvement projects.

  • Test and assess existing infrastructure against industry standard internal and external benchmarks to ensure optimal performance and service delivery.

  • Participate in IT and information security audits and prioritize corrective actions and successful remediation of areas supervised to ensure that continuous improvements are made on an ongoing basis.

  • Participate in the change management process to ensure all changes to relevant services are documented tested deployedand prepped for back-out strategies if necessary.

  • Aware of industry trends and how to incorporate them with our infrastructure environment to improve services and/or cut costs.

8. Communication

  • Effectively communicate complex data analyses to provide technical and strategic input during the planning phase of potential projects in the form of technical architecture designs and recommendations.

  • Regularly communicate with Cyberinfrastructure Technologies colleagues regarding initiatives.

  • Keep the ES Manager informed of current and potential issues activities operational outages and any other risks that might jeopardize or degrade IT service delivery to the University community.

9. Operations

  • Suggest operation strategies to accommodate major shifts in customer needs.

  • Determine procedures and methods for operational tasks required to maintain data center servers and related systems in reliable stable operation. This person will use their experience and judgment to plan and accomplish goals and objectives and to identify potential problems and define/implement solutions.

  • Supports academic programs by providing the necessary expertise and technical support to make faculty and student technology adoption successful e.g. consulting with faculty launching initiatives identifying their needs evaluating solutions implementing those solution

  • Support compute and storage needs of institutional programs by providing the necessary expertise and technical support to build a robust and highly available solution to meet their needs

  • Empower end users to successfully use the technology.

  • Interface with vendors external resources.

  • Evaluate new software or systems under consideration for adoption.

  • Ensure asset management procedures are maintained and documented.

  • Work with the Enterprise Systems team to automate and streamline procedures within the department.

  • On occasion work beyond and in addition to traditional work schedules/hours. Required to carry a cell phone and be on-call.

  • Utilize technologies and tools to support the compute and storage infrastructure: programming scripting diagnostic tools.

10. Other duties as assigned by the Manager of Enterprise Systems and IS leadership.

C. PROVIDES WORK DIRECTION

May supervise student workers.

D. GENERAL GUIDELINES

  • Recommends initiatives and implements changes to improve quality and services.

  • Identifies and determines cause of problems; develops and presents recommendations for improvement of established processes and practices.

  • Maintains contact with customers and solicits feedback for improved services.

  • Maximizes productivity through use of appropriate tools; plan training and performance initiatives.

  • Researches and develops resources that create timely and efficient work flow.

  • Prepares progress reports; informs supervisor of project status and deviation from goals. Ensures completeness accuracy and timeliness of all operational functions.

  • Prepares and submits reports as requested and required.

  • Develops and implements guidelines to support the functions of the unit.

E. QUALIFICATIONS

To perform this job successfully an individual must be able to perform each essential duty satisfactorily. The items below are representative of the knowledge skills abilities education and experience required or preferred.

This position requires the ability to effectively establish and maintain cooperative working relationships within a diverse multicultural environment.

1. Knowledge Skills and Abilities

General

  • Knowledge of information technology campus technology and information security issues and trends in higher education and ability to continually develop new knowledge regarding the same.

  • Ability to listen and understand customer needs.

  • Ability to plan implement and evaluate customer service initiatives.

  • Ability to work in a collaborative environment as either a member or leader of a team to meet deadlines and achieve goals.

  • Ability to manage a diverse workforce to provide excellent customer service.

  • Self-motivated and shows initiative.

  • Ability to successfully manage multiple projects simultaneously.

  • Proven track record in project planning and project management.

  • Ability to exercise independent judgment and engage in critical thinking and problem solving.

  • Ability to work effectively under pressure in a busy (sometimes chaotic) and demanding information services environment.

  • Ability to explain technical issues and policies to non-technical people.

  • Ability to give presentations on technical issues to a broad range of audiences.

  • Ability to foster and maintain good working relationships with faculty administrators students senior management and other leaders.

  • Ability to handle sensitive matters with diplomacy and the ability to mediate between competing parties.

  • Ability to maintain confidentiality and manage confidential information.

  • Must possess impeccable integrity.

  • Ability to speak truth to power.

  • Appreciation for the Universitys mission vision values priorities procedures and policies.

Position-specific

  • Knowledgeable and experienced in large-scale computer center operations with multiple systems running Linux and Windows with Server operating systems

  • Experience with managing and operating SAN storage environments

  • Strong proficiency in the management of multi-platform hardware and software environments including Microsoft Linux (Red Hat).

  • Experience with configuration management tools such as Ansible and Warewulf

  • Experience with Slurm and job scheduling

  • Strong proficiency with scripting languages (Python Bourne Shell Perl etc.)

  • Experience with compiling software packages and managing software modules in a HPC (EasyBuild Lmod)

  • Experience with racking servers and adding PCI cards

  • Experience with LDAP and DNS

  • Experience with parallel file systems

  • Experience with Infiniband networks

  • The University technology environment is very dynamic and challenging. A person with a wide breadth of experience and who can adapt to changes working in a complex technology infrastructure environment is sought.

  • Experience with vSphere ESXi

  • Experience in using and configuring system monitoring tools

  • Experience with enterprise Backups

  • Experience with cloud providers

  • Skilled technical troubleshooter. Must be able to analyze and solve complex problems.

  • Knowledgeable in the use of a personal computer and standard productivity tools

  • Experience interacting and working with other people in a successful customer service capacity

  • Industry trends in enterprise infrastructure/data center technology including: automation tools cloud technology disaster recovery virtualization networking security and other pertinent areas.

  • Experience with Identity and Access Management (IAM)

  • Excellent interpersonal written and verbal communication skills

  • Demonstrated ability to work in a collaborative team environment

  • Strong organizational skills and ability to multitask

  • Must be a self-starter and show initiative to proactively identify and resolve problems

  • Must have the ability to acquire and apply new skills quickly

  • Strong customer service orientation

  • Understands the role of enterprise computing in University business processes

  • Works under limited supervision


2. Education

  • Bachelors degree in a directly applicable field of study (Computer or Electrical Engineering Math/Computer Science Operations and Management Information Science)

  • Advanced Degree preferred in directly applicable field of study or a field of management

3. Experience

  • 8 years applicable experience in the operation maintenance support and design of enterprise-wide computer center systems with demonstrated increasing responsibilities

  • 2 years of experience supporting an HPC required including experience in Slurm or similar workload manager; InfiniBand or similar high speed interconnect; and Lustre or similar parallel file system.

  • Experience working for the needs of Higher Education or research organizations is desirable

F. PHYSICAL DEMANDS

The physical demands described below are representative of those that must be met by an employee to successfully perform the essential functions of this accordance with the Americans with Disabilities Act as amended the California Fair Employment & Housing Act and all other applicable laws SCU provides reasonable accommodations for qualified persons with disabilities. A qualified individual is a person who meets skill experience education or other requirements of the position and who can perform the essential functions of the position with or without reasonable accommodation.

  • Must be able to handle and operate equipment which can be located in racks on tall shelves or in cabinets

  • Must be able to climb ladders and work at heights

  • Must be able to work in confined spaces: crawl spaces under raised flooring and in or under furniture

  • Considerable time is spent at a desk using a computer terminal

  • Will be required to travel to other buildings on the campus

  • May be required to occasionally travel to remote campuses outside customers vendors or suppliers

  • May be required to attend conference and training sessions within Bay Area or in- or out-of-state locations

G. WORK ENVIRONMENT

The work environment characteristics described below are representative of those an employee encounters while performing the essential functions of this job.

  • Typical office and computer lab environment

  • Mostly indoor office environment with windows

  • Offices with equipment noise

  • Offices with frequent interruptions

  • Data Centers wiring and equipment closets with loud noise low light and tight spaces

  • Raised floor under floor access and above ceiling spaces

  • Roofs high on walls ceilings basements and other locations where equipment is stored

Telecommute

Santa Clara University is registered to do business in the following states: California Nevada Oregon Washington Arizona and Illinois. Employees approved to telecommute are required to perform their work within one of these states.

EEO Statement

Equal Opportunity/Notice of Nondiscrimination

Santa Clara University is an equal opportunity employer. All qualified applicants are encouraged to apply and will receive consideration for employment without regard to race color ethnicity national origin citizenship ancestry religion age sex sexual orientation gender gender expression gender identity marital status parental status veteran or military status physical or mental disability medical conditions pregnancy or related conditions reproductive health decision making or any other characteristic protected by federal state or local laws. For a complete copy of Santa Clara Universitys equal opportunity and nondiscrimination policies please visit the Office of Equal Opportunity and Title IX website at Notice of Availability

Santa Clara University annually collects information about campus crimes and other reportable incidents in accordance with the federal Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act. To view the Santa Clara University report please visit the Campus Safety Serviceswebsite. To request a paper copy please call Campus Safety at . The report includes the type of crime venue and number of occurrences.

Americans with Disabilities Act

Consistent with its obligations under the law Santa Clara University will provide reasonable accommodations to applicants and employees with disabilities. Applicants who wish to request a reasonable accommodation for any part of the application or hiring process should contact the Department of Human Resources ADA Team ator by phone at .


Required Experience:

Unclear Seniority

Position Title:HPC System AdministratorPosition Type:RegularHiring Range:$129000 - $161265/annually;Compensation will be based on education experience skills relevant to the role and internal equity.Pay Frequency:AnnualA. POSITION PURPOSEThe High-Performance Computing (HPC) System Administrator is a...
View more view more

Key Skills

  • Active Directory
  • VMware
  • Computer Networking
  • Microsoft Windows Server
  • Solaris
  • Windows
  • Linux
  • SAN
  • Shell Scripting
  • System Administration
  • Dns
  • CentOS

About Company

Company Logo

Headquartered in the most innovative place on earth, SCU is a private Jesuit university in Santa Clara, California, in Silicon Valley. Consistently recognized as one of the top universities in the nation, Santa Clara offers bachelor's, master's, and doctoral degrees through its six co ... View more

View Profile View Profile