COMPANY OVERVIEW
XCEL Engineering Inc. is an award-winning small business that provides trusted information technology engineering consulting and project management solutions and services to federal agencies and organizations. Originally founded in 1971 by professional engineers at the University of Tennessee XCEL was acquired in 2003 by U.S. Army and Navy veterans and in 2023 became a MartinFed company.
XCEL Engineering is a part of IT Lab Partners (ITLP) which was created to support a leading research facility in the East Tennessee region in recruiting the best and the brightest technical talent. Considering joining our impressive team today!
JOB OVERVIEW
Xcel Engineering is seeking a Senior HPC Storage Systems Engineer to design operate and maintain clusters servers and workstations storage supporting services where science happens at ORNL! This position resides in the Emerging Technologies & Computing team in the Research Computing group in the Information Technology Services Directorate at Oak Ridge National Laboratory (ORNL).
The Emerging Technology Computational Group facilitates goals through HPC systems engineering integration and support for the research community. By providing design deployment optimization monitoring and tooling support across multiple clustered storage infrastructures we facilitate Lab-wide R&D projects. Our HPC clusters range in scope from just a handful of nodes to over fifty-thousand cores.
We partner with ORNL research organizations to enable research excellence and delivery. We work with other clustered computing and HPC groups to help research programs identify the best solutions for their needs. When we build our customers environments our team collaborates to design implement and maintain the systems from inception to retirement.
ESSENTIAL FUNCTIONS
- Architect deploy and manage large-scale HPC storage systems including parallel file systems such as Lustre GPFS/Spectrum Scale BeeGFS and WEKA
- Design implement and operate large-scale Ceph storage clusters for HPC and research workloads delivering reliable high-performance object block and file storage services.
- Ensure the availability performance scalability and security of production storage environments.
- Administer and optimize enterprise storage platforms such as Qumulo and NetApp in support of HPC and research workloads.
- Design deploy and maintain archival storage solutions including Spectra Logic BlackPearl and large-scale tape libraries to ensure long-term data preservation and accessibility.
- Integrate high-performance enterprise and archival storage layers into cohesive tiered storage architectures that balance cost scalability and performance for diverse scientific workflows.
- Leverage automation and monitoring solutions to minimize day-to-day maintenance while identifying opportunities to optimize system performance and management.
- Collaborate with researchers and technical POCs to support large data workflows and optimize I/O performance for scientific workloads.
- Automate storage provisioning monitoring and maintenance using scripting and configuration management tools.
- Diagnose and resolve complex storage and I/O-related issues in high-throughput low-latency HPC environments.
- Evaluate emerging storage technologies (NVMe object storage hierarchical storage management burst buffers) and contribute to strategic planning for future HPC systems.
- Work with 24/7 operations staff to streamline monitoring and troubleshooting significantly reducing the need for off-hours support.
- Deliver ORNLs mission by aligning behaviors priorities and interactions with our core values of Impact Integrity Teamwork Safety and Service. Promote equal opportunity by fostering a respectful workplace.
BASIC QUALIFICATIONS
- A BS degree in computer science computer engineering information technology information systems science engineering or related discipline and 8-12 years of relevant professional experience; or an equivalent combination of education and experience.
- Masters degree holders: 7-10 years of relevant experience.
- PhD holders: 4-6 years of relevant experience.
- Five (5) or more years managing UNIX/Linux systems.
- Demonstrated experience managing HPC storage and large-scale enterprise storage systems.
- Three (3) or more years working with configuration management and automation tools such as Git Jenkins Ansible or Puppet.
- Proficiency with at least one scripting language (Bash Python Perl etc.).
- Strong Linux administration and advanced troubleshooting experience.
- Experience supporting large data systems and/or HPC scientific workloads.
- Strong desire to innovate and evaluate new technologies for HPC and storage environments.
- Collaborative approach and ability to become a trusted advisor to research teams.
DESIRED QUALIFICATIONS
- Active DOE Q DoD Top Secret or TS/SCI clearance is strongly preferred.
- Solid understanding of multiple operating systems and HPC cluster technologies.
- Experience with Rocky/CentOS/RHEL Ubuntu VMware.
- Understanding of HPC job schedulers (SLURM) and user support workflows.
- Experience with container technologies in HPC environments.
- Experience with multiple system deployment mechanisms (Warewulf PXEboot Cobbler Bright).
- Experience with GPU clusters (NVIDIA AMD) for AI/ML and scientific workloads.
- Deep expertise with high-performance parallel file systems (Lustre GPFS/Spectrum Scale BeeGFS WEKA).
- Knowledge of storage networking (Infiniband NVMe-oF SAN/NAS architectures).
- Familiarity with RAID ZFS and object storage technologies.
- Strong background in performance monitoring benchmarking and I/O optimization.
- Experience with monitoring systems such as Grafana CheckMK Nagios Zabbix Ganglia.
- Previous experience working in a government scientific or other highly technical environment.
- Strong documentation skills and ability to prepare web-based documentation.
PHYSICAL REQUIREMENTS & ENVIRONMENTAL CONDITIONS
- Inside office environment.
- Working on a computer for long periods of time.
- May involve long period of sitting at a desk.
- The work environment is fast-paced and sometimes involves extreme deadline pressures.
OTHER DUTIES
This job description is not designed to cover or contain a comprehensive listing of activities duties or responsibilities that are required of the employee for this job. Duties responsibilities and activities may change at any time with or without notice.
Xcel Engineering is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regards to race color religion religious creed gender sexual orientation gender identity gender expression transgender pregnancy marital status national origin ancestry citizenship status age disability protected Veteran Status genetics or any other characteristics protected by applicable federal state or local law.
If you are a qualified individual with a disability or disabled veteran you have the right to request a reasonable accommodation if you are unable or limited in your ability to use or access Xcel Engineerings current openings as a result of your disability. You can request reasonable accommodations by calling 855.212.1810. Thank you for your interest in Xcel Engineering.
All positions at Xcel Engineering Inc. are contingent upon passing both a background check and drug screening prior to a start date and are subject to random drug screenings during the employment addition Xcel Engineering is an E-Verify employer.
Required Experience:
Senior IC