In this role youll make an impact in the following ways:
- Be hands-on with enterprise-grade NVIDIA AI infrastructure supporting GPU-based compute high-performance storage and network systems designed for ML/AI at scale.
- Deploy monitor and troubleshoot containerized AI workloads using Kubernetes Docker and GPU orchestration tools like Run:AI and NVIDIA BCM.
- Own the observability of our AI platformsmonitor health identify performance bottlenecks and make strategic recommendations to drive platform reliability and maturity.
- Automate infrastructure operations and provisioning using Python Bash and tools like Terraform or Ansible to reduce manual toil and accelerate experimentation.
- Maintain and scale AI training and inference pipelines integrating infrastructure workflows into CI/CD systems to enable seamless automated deployment of AI workloads.
To be successful in this role were seeking the following:
- Bachelors degree in computer science or a related discipline or equivalent work experience required; advanced degree preferred8-10 years of related experience required; experience in the securities or financial services industry is a plus.
- Experience with Linux administration (RHEL/Ubuntu) shell scripting and system-level debugging.
- Proven experience running distributed systems in Kubernetes and containerized environments using Docker.
- Familiarity with GPU resource management including NVIDIA GPU Operator and device plugin lifecycle.
- Experience with CI/CD workflows and infrastructure automation tools such as GitLab CI Jenkins Terraform Helm or Ansible.
- Knowledge of networking fundamentals and persistent storage systems.
- Exposure to cloud platforms (AWS GCP Azure) and hybrid GPU environments.
- Ability to read and support Python code focused on ML/AI pipeline integration.
- Strong analytical and troubleshooting skills with a collaborative mindset.
Effective communication skills and proactive ownership of platform reliability and performance.
Regards
Mohammed ilyas
PH - or Text - or you can share the updated resume at com
Additional Information :
All your information will be kept confidential according to EEO guidelines.
Remote Work :
No
Employment Type :
Full-time
In this role youll make an impact in the following ways:Be hands-on with enterprise-grade NVIDIA AI infrastructure supporting GPU-based compute high-performance storage and network systems designed for ML/AI at scale.Deploy monitor and troubleshoot containerized AI workloads using Kubernetes Docker ...
In this role youll make an impact in the following ways:
- Be hands-on with enterprise-grade NVIDIA AI infrastructure supporting GPU-based compute high-performance storage and network systems designed for ML/AI at scale.
- Deploy monitor and troubleshoot containerized AI workloads using Kubernetes Docker and GPU orchestration tools like Run:AI and NVIDIA BCM.
- Own the observability of our AI platformsmonitor health identify performance bottlenecks and make strategic recommendations to drive platform reliability and maturity.
- Automate infrastructure operations and provisioning using Python Bash and tools like Terraform or Ansible to reduce manual toil and accelerate experimentation.
- Maintain and scale AI training and inference pipelines integrating infrastructure workflows into CI/CD systems to enable seamless automated deployment of AI workloads.
To be successful in this role were seeking the following:
- Bachelors degree in computer science or a related discipline or equivalent work experience required; advanced degree preferred8-10 years of related experience required; experience in the securities or financial services industry is a plus.
- Experience with Linux administration (RHEL/Ubuntu) shell scripting and system-level debugging.
- Proven experience running distributed systems in Kubernetes and containerized environments using Docker.
- Familiarity with GPU resource management including NVIDIA GPU Operator and device plugin lifecycle.
- Experience with CI/CD workflows and infrastructure automation tools such as GitLab CI Jenkins Terraform Helm or Ansible.
- Knowledge of networking fundamentals and persistent storage systems.
- Exposure to cloud platforms (AWS GCP Azure) and hybrid GPU environments.
- Ability to read and support Python code focused on ML/AI pipeline integration.
- Strong analytical and troubleshooting skills with a collaborative mindset.
Effective communication skills and proactive ownership of platform reliability and performance.
Regards
Mohammed ilyas
PH - or Text - or you can share the updated resume at com
Additional Information :
All your information will be kept confidential according to EEO guidelines.
Remote Work :
No
Employment Type :
Full-time
View more
View less