drjobs AI Site Reliability Engineer(W2)

AI Site Reliability Engineer(W2)

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Reston, VA - USA

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

Role : AI Site Reliability Engineer(W2)

Location : Virginia Reston

Skills : SRE NVIDIA (DGX) Python Ansible Terraform Site Reliability Engineer Linux

Role : AI Site Reliability Engineer

No. Positions : 2

Location: Remote

Notice period: 2 weeks

Visa: Any (Except OPT and CPT)

Note: Need atleast 1 or 2 resumes by today EOD please try to submit profiles please.

Your Role as an AI Site Reliability Engineer

We are building developing and expanding our artificial intelligence platforms which will empower the business to fundamentally change the world. You will be an AI Site Reliability Engineer in the IT Infrastructure Services organization. You will use SRE mechanisms to reduce toil and maintain Service Level Objectives (SLOs) for our internal NVIDIA DGX and Cisco-UCS based AI platforms. You will lead build and run fully automated pipelines through our Continuous Integration/ Continuous Delivery (CI/CD) system to deliver operational capabilities and improvements.

Responsibilities include

  • Technical knowledge of high-performance compute NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
  • Handle availability latency scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
  • Drive capacity planning performance analysis instrumentation and other non-functional systems requirements.
  • Automate operational capabilities using Python Ansible Terraform Go etc.
  • Deliver automation through CI/CD pipeline and chatbot etc.
  • Implement metrics driven processes to ensure service quality targets are met.

Who You Are

You are an experienced Site Reliability Engineer for high performance compute artificial intelligence machine learning and/or integrated computer systems. You have a software engineering approach for solving operational problems. You know HPC and are familiar with Kubernetes. You have experience delivering software solutions and Linux operating systems. You understand IT infrastructure customers and are passionate about diving deep into problems and fixing them.

Our Minimum Requirements include:

  • Bachelors degree in computer science Information Technology or related field; or equivalent years of experience in information technology.
  • Experience deploying and administrating NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray HPE IBM).
  • 5 year administrating and supporting Linux based operating systems.
  • Experience writing code in general-purpose programming languages such as: Python GoLang C/C and using GIT and CI/CD systems (e.g. GitLab GitHub Actions Jenkins).
  • Experience in deploying Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos.
  • Sophisticated knowledge of Kubernetes Dockers Terraform Ansible Jenkins GitOps Git Linux
  • Software development lifecycle includes design development testing packaging deployment using Python or Golang

Preferred Qualifications

  • Masters degree or equivalent experience in relevant field.
  • Certifications in Linux Networking Cloud or related technologies.
  • Prior successful experience as a compute or site/systems reliability engineer.
  • Experience with Kubernetes Hybrid Cloud Virtualization and Container technologies.
  • Experience with Agile and DevOps operating models including project tracking tools (e.g. Jira Rally).
  • Excellent collaborator who can partner lead guide and communicate advanced technical concepts

Employment Type

Full-time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.