HPC Data Center Developer

Jump Trading

Job Location:

Chicago, IL - USA

Monthly Salary: $ 150000 - 200000

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

HPC Data Center Production Engineer
Location: Chicago or New York (On-site 5 days/week)
Jump Trading Group is committed to world class research. We empower exceptional talents in Mathematics Physics and Computer Science to seek scientific boundaries push through them and apply cutting edge research to global financial markets. Our culture is unique. Constant innovation requires fearlessness creativity intellectual honesty and a relentless competitive streak. We believe in winning together and unlocking unique individual talent by incenting collaboration and mutual respect. At Jump research outcomes drive more than superior risk adjusted returns. We design develop and deploy technologies that change our world fund start-ups across industries and partner with leading global research organizations and universities to solve problems.
Trading Infrastructure is a global organization of Engineers who architect build and maintain our world-class infrastructure. From colo design/implementation to optimizing our exchange connectivity to building world class low latent Wide Area Networks we leverage research and automation to consistently adapt and innovate our infrastructure to scale and drive our trading and evolving business.
We are looking for an HPC Data Center Production Engineer to build and own the automation and tooling that powers Jumps HPC data center operations. This is a development-heavy role focused on automating the onboarding and lifecycle management of data center hardwareservers switches rack PDUs CDUs and environmental sensorsand building tools for capacity planning outage simulation monitoring and metrics integration. You will work hand-in-hand with the HPC Planning Engineering and Operations leads to turn tooling and monitoring vision into production-ready systems. Heavy daily use of AI tools is expected to accelerate development and raise the quality bar on everything you build.
What Youll Do:
Hardware Onboarding Automation

Design develop and maintain automation to onboard new hardware devices into Jumps HPC data centers including servers network switches rack PDUs CDUs and environmental sensors.
Build end-to-end provisioning workflows that take hardware from racked-and-cabled through discovery configuration validation and production-ready state with minimal manual intervention.
Extend and adapt onboarding automation as new hardware platforms and device types are introduced.
Data Center Tooling Development
Develop tools for power and cooling capacity planningenabling the operations and planning teams to model current utilization forecast growth and identify constraints before they become problems.
Build outage simulation tooling to model the impact of power cooling or network failures across HPC facilities and validate redundancy/failover configurations.
Develop and maintain operational tooling that supports day-to-day data center workflows such as hardware lifecycle tracking data center inventory/spares change management and diagnostics.
Monitoring & Metrics Integration
Build and maintain monitoring integrations for HPC data center infrastructurepulling telemetry from servers switches PDUs CDUs environmental sensors and facility systems into centralized observability platforms.
Integrate metrics feeds from colocation and data center providers into Jumps monitoring stack normalizing data for alerting and capacity reporting.
Work with the Operations Lead to implement the monitoring and alerting strategy translating requirements into deployed production-grade instrumentation.
Cross-Team Collaboration
Work very closely with the HPC Planning Engineering and Operations leads to understand tooling and monitoring needs and bring their vision to fruition.
Partner with HPC Engineering on integration points between data center automation and compute/storage/network provisioning systems.
Translate operational pain points and manual processes into automated maintainable solutions.
Systems Maintenance & Reliability
Own the reliability and lifecycle of all systems and tools you developmonitor for failures respond to issues and iterate based on operational feedback.
Maintain comprehensive documentation for all tooling automation workflows and integrations.
Participate in large coordinated maintenance operations including during evenings and weekends.
AI-Driven Development
Use AI tools daily across all aspects of the role: writing and reviewing code analyzing data debugging generating documentation and accelerating development velocity.
Identify opportunities to apply AI to data center operations problemsanomaly detection predictive capacity planning intelligent alerting and beyond.
Additional duties as assigned or needed.
Skills Youll Need:
5 years of professional experience in production engineering infrastructure automation or site reliability engineering preferably in HPC or large-scale data center environments.
Proven track record of building and shipping production automation and toolingnot just scripts but maintained reliable systems.
Experience automating hardware provisioning and lifecycle management (servers network devices power/cooling infrastructure).
Strong understanding of data center infrastructure: power distribution cooling systems (air and liquid) environmental monitoring and structured cabling.
Experience integrating with hardware management interfaces (IPMI/BMC/Redfish SNMP vendor APIs) for discovery configuration and telemetry collection.
Demonstrates a high level of energy results driven and able to work under pressure with tight deadlines.
Technical Skills:
High proficiency in Golang and at least one additional language (e.g. Python). You will write a lot of code in this role.
Strong Linux systems knowledgeyou should live in Linux. Proficient with system administration networking storage process management log analysis and troubleshooting at the OS level.
Experience with Grafana for building dashboards alerting and visualization of infrastructure metrics. Experience with Prometheus InfluxDB or similar observability platforms and building custom integrations/exporters.
Experience with configuration management and infrastructure-as-code tools (SaltStack Ansible Terraform or similar).
Solid understanding of networking concepts: L2/L3 protocols VLANs BGP SNMP and switch/router configuration (Arista Cisco).
Experience with APIs and data integrationconsuming vendor APIs normalizing heterogeneous data sources building data pipelines for metrics and reporting.
Experience with ClickHouse and MySQLwriting queries designing schemas and building tooling that reads from and writes to these databases.
Experience with GitHub for version control code review CI/CD workflows and collaborative development.
Demonstrated heavy use of AI tools (e.g. LLM-based coding assistants AI-driven analytics) in a professional setting. You should already be using AI daily and be eager to push its application further.
A compulsion to perform root cause analysis.
Excellent written and verbal communication skills with the ability to work across a global engineering team.
Extremely high personal standards for work quality.
Reliable and predictable availability including ability to work evenings and weekends as required.
Bachelors degree preferred.

Benefits

- Discretionary bonus eligibility
- Medical dental and vision insurance
- HSA FSA and Dependent Care options
- Employer Paid Group Term Life and AD&D Insurance
- Voluntary Life & AD&D insurance
- Paid vacation plus paid holidays
- Retirement plan with employer match
- Paid parental leave
- Wellness Programs

Annual Base Salary Range

$150000 - $200000 USD

Required Experience:

Design develop and maintain automation to onboard new hardware devices into Jumps HPC data centers including servers network switches rack PDUs CDUs and environmental sensors.
Build end-to-end provisioning workflows that take hardware from racked-and-cabled through discovery configuration validation and production-ready state with minimal manual intervention.
Extend and adapt onboarding automation as new hardware platforms and device types are introduced.
Data Center Tooling Development
Develop tools for power and cooling capacity planningenabling the operations and planning teams to model current utilization forecast growth and identify constraints before they become problems.
Build outage simulation tooling to model the impact of power cooling or network failures across HPC facilities and validate redundancy/failover configurations.
Develop and maintain operational tooling that supports day-to-day data center workflows such as hardware lifecycle tracking data center inventory/spares change management and diagnostics.
Monitoring & Metrics Integration
Build and maintain monitoring integrations for HPC data center infrastructurepulling telemetry from servers switches PDUs CDUs environmental sensors and facility systems into centralized observability platforms.
Integrate metrics feeds from colocation and data center providers into Jumps monitoring stack normalizing data for alerting and capacity reporting.
Work with the Operations Lead to implement the monitoring and alerting strategy translating requirements into deployed production-grade instrumentation.
Cross-Team Collaboration
Work very closely with the HPC Planning Engineering and Operations leads to understand tooling and monitoring needs and bring their vision to fruition.
Partner with HPC Engineering on integration points between data center automation and compute/storage/network provisioning systems.
Translate operational pain points and manual processes into automated maintainable solutions.
Systems Maintenance & Reliability
Own the reliability and lifecycle of all systems and tools you developmonitor for failures respond to issues and iterate based on operational feedback.
Maintain comprehensive documentation for all tooling automation workflows and integrations.
Participate in large coordinated maintenance operations including during evenings and weekends.
AI-Driven Development
Use AI tools daily across all aspects of the role: writing and reviewing code analyzing data debugging generating documentation and accelerating development velocity.
Identify opportunities to apply AI to data center operations problemsanomaly detection predictive capacity planning intelligent alerting and beyond.
Additional duties as assigned or needed.
Skills Youll Need:
5 years of professional experience in production engineering infrastructure automation or site reliability engineering preferably in HPC or large-scale data center environments.
Proven track record of building and shipping production automation and toolingnot just scripts but maintained reliable systems.
Experience automating hardware provisioning and lifecycle management (servers network devices power/cooling infrastructure).
Strong understanding of data center infrastructure: power distribution cooling systems (air and liquid) environmental monitoring and structured cabling.
Experience integrating with hardware management interfaces (IPMI/BMC/Redfish SNMP vendor APIs) for discovery configuration and telemetry collection.
Demonstrates a high level of energy results driven and able to work under pressure with tight deadlines.
Technical Skills:
High proficiency in Golang and at least one additional language (e.g. Python). You will write a lot of code in this role.
Strong Linux systems knowledgeyou should live in Linux. Proficient with system administration networking storage process management log analysis and troubleshooting at the OS level.
Experience with Grafana for building dashboards alerting and visualization of infrastructure metrics. Experience with Prometheus InfluxDB or similar observability platforms and building custom integrations/exporters.
Experience with configuration management and infrastructure-as-code tools (SaltStack Ansible Terraform or similar).
Solid understanding of networking concepts: L2/L3 protocols VLANs BGP SNMP and switch/router configuration (Arista Cisco).
Experience with APIs and data integrationconsuming vendor APIs normalizing heterogeneous data sources building data pipelines for metrics and reporting.
Experience with ClickHouse and MySQLwriting queries designing schemas and building tooling that reads from and writes to these databases.
Experience with GitHub for version control code review CI/CD workflows and collaborative development.
Demonstrated heavy use of AI tools (e.g. LLM-based coding assistants AI-driven analytics) in a professional setting. You should already be using AI daily and be eager to push its application further.
A compulsion to perform root cause analysis.
Excellent written and verbal communication skills with the ability to work across a global engineering team.
Extremely high personal standards for work quality.
Reliable and predictable availability including ability to work evenings and weekends as required.
Bachelors degree preferred.