drjobs HPC SRE Systems Engineer

HPC SRE Systems Engineer

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Dearborn, MI - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Description

We are seeking a highly skilled and motivated HPC SRE Systems Engineer to join our growing team. You will be responsible for designing building and maintaining our HPC and SRE infrastructure that our platform depends on for daily operation ensuring optimal performance and reliability for our critical applications. This role will also have a focus on automating deployments of our infrastructure and monitoring stack leveraging CICD and IaC. If you are interested to engage with a dynamic HPC stack and be a driving force working towards the resiliency of our platform this position could be a good fit for you.



Responsibilities

What youll do...

  • Design implement and maintain a robust and scalable HPC infrastructure to support containerized AI/ML workloads across traditional HPC and Kubernetes environments.
  • Implement monitoring solutions to ensure health and availability of critical infrastructure and applications.
  • Develop automation for repeatable and resilient infrastructure deployments.
  • Troubleshoot and resolve complex technical issues related to Linux systems networking storage and HPC applications.
  • Develop and maintain documentation for software and procedures.
  • Collaborate with software engineers and researchers to ensure seamless integration of HPC resources and scaling of applications.
  • Stay up-to-date on the latest advancements in HPC and AI/ML technologies and best practices.


Qualifications

Youll have...

  • Associates degree in Computer Science Engineering or work experience equivalent.
  • 5 years of experience in Systems or Software engineering
  • Strong understanding of Linux operating systems preferably in an HPC environment
  • Proficiency programming in one or more languages preferably go python or bash scripting.
  • Familiarity with how to scale applications and the metrics collection analysis and visualization tools used to identify bottlenecks like Prometheus and Grafana.
  • Excellent problem-solving and troubleshooting skills. The ability to define what problems need to be solved.
  • Strong communication and collaboration skills.

Even better you may have...

  • Experience with containerization technologies like Docker or Kubernetes.
  • Experience with automation tools like Ansible Puppet or Chef.
  • Experience with monitoring tools like Prometheus Icinga Nagios or Elasticsearch.


Employment Type

Full-Time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.