Senior DevOps Engineer, GPUaaS

Singapore - Singapore

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About Singtel Digital InfraCo RE:AI

Singtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

As an DevOps Engineer for SingTels GPU-as-a-Service (GPUaaS) you will help in implementing processes and integration of operations to advance customers AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel GPU-as-a-Service (GPUaaS).This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This role is suitable for professionals looking to develop their expertise in DevOps and AI/HPC cloud platforms.

Responsibilities

Design deploy and support large-scale distribute GPU clusters for AI and ML workloads.
Manage and automate provisioning of GPU resources in both on-prem and cloud platforms.
Design implement and manage CI/CD pipelines for AI models and GPU-accelerated applications.
Monitor cluster usage health performance and availability.
Improve infrastructure provisioning management and monitoring through automation.
Troubleshoot compute resource system level issues such as Slurm Kubernetes GPU drivers CUDA IB networking.
Optimize system parameters (e.g. OS drivers networking library) for AI workload performance.
Conduct GPU cluster benchmark and keeping up with the latest advancements in GPU technology.
Set up monitoring and logging for GPU resources using Zabbix Prometheus NVIDIA DCGM and other tools.
Implement security best-practices for multi-tenant GPU-as-a-Service (GPUaaS) environment.
Collaborate with software and administrator to to streamline workflows and improve collaboration.
Providing technical support and guidance to users of GPU-accelerated systems.
Work with senior DevOps engineer to identify bottlenecks and improve development and operational processes for AI and HPC GPU cloud.
Learning to solve problems in high-performance distributed computation for AI and HPC GPU cloud computing.
Participate in rotational or scheduled shift work as required to support platform operations.

Requirements

Bachelors degree in Computer Science/Engineering Information Technology Systems Engineering or a related field.
Strong Linux system administration skills in Ubuntu/CentOS/Rocky Linux etc.
Experience with DevOps tools such as Jenkins Kubernetes Ansible and Terraform.
Solid understanding of DevOps practices including CI/CD automation and monitoring.
Proficiency in scripting languages (e.g. Python Bash).
Experience in implementing monitoring solutions such as Zabbix Prometheus.
Familiarity with AI frameworks such as TensorFlow PyTorch.
Understanding of cloud architectures (IaaS PaaS) GPU architecture and NVIDIA GPUs.
Strong verbal written and presentation skills in English.
Team player with experience in cross-functional coordination.
Strong technical problem solving and analytical skills for system optimization.

Desirable qualifications

Understanding of how collective communications (MPI RDMA and NCCL) works as well as an understanding of GPU specific aceleration works on GPU cluster.
Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers Kubernetes data center deployments
Familiarity with Slurm or other HPC workload managers to manage GPU clusters.
Understanding of AI & HPC networking technologies such as InfiniBand RoCE DPUs.
System-level experience specifically GPU-based systems (NVIDIA GPU and SDKs)

Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.

Rewards that Go Beyond

Flexible work arrangements
Full suite of health and wellness benefits
Ongoing training and development programs
Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!

Required Experience:

Senior IC

About Singtel Digital InfraCo RE:AISingtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI comp...

About Singtel Digital InfraCo RE:AI

Be a Part of Something BIG!

Responsibilities

Design deploy and support large-scale distribute GPU clusters for AI and ML workloads.
Manage and automate provisioning of GPU resources in both on-prem and cloud platforms.
Design implement and manage CI/CD pipelines for AI models and GPU-accelerated applications.
Monitor cluster usage health performance and availability.
Improve infrastructure provisioning management and monitoring through automation.
Troubleshoot compute resource system level issues such as Slurm Kubernetes GPU drivers CUDA IB networking.
Optimize system parameters (e.g. OS drivers networking library) for AI workload performance.
Conduct GPU cluster benchmark and keeping up with the latest advancements in GPU technology.
Set up monitoring and logging for GPU resources using Zabbix Prometheus NVIDIA DCGM and other tools.
Implement security best-practices for multi-tenant GPU-as-a-Service (GPUaaS) environment.
Collaborate with software and administrator to to streamline workflows and improve collaboration.
Providing technical support and guidance to users of GPU-accelerated systems.
Work with senior DevOps engineer to identify bottlenecks and improve development and operational processes for AI and HPC GPU cloud.
Learning to solve problems in high-performance distributed computation for AI and HPC GPU cloud computing.
Participate in rotational or scheduled shift work as required to support platform operations.

Requirements

Bachelors degree in Computer Science/Engineering Information Technology Systems Engineering or a related field.
Strong Linux system administration skills in Ubuntu/CentOS/Rocky Linux etc.
Experience with DevOps tools such as Jenkins Kubernetes Ansible and Terraform.
Solid understanding of DevOps practices including CI/CD automation and monitoring.
Proficiency in scripting languages (e.g. Python Bash).
Experience in implementing monitoring solutions such as Zabbix Prometheus.
Familiarity with AI frameworks such as TensorFlow PyTorch.
Understanding of cloud architectures (IaaS PaaS) GPU architecture and NVIDIA GPUs.
Strong verbal written and presentation skills in English.
Team player with experience in cross-functional coordination.
Strong technical problem solving and analytical skills for system optimization.

Desirable qualifications

Understanding of how collective communications (MPI RDMA and NCCL) works as well as an understanding of GPU specific aceleration works on GPU cluster.
Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers Kubernetes data center deployments
Familiarity with Slurm or other HPC workload managers to manage GPU clusters.
Understanding of AI & HPC networking technologies such as InfiniBand RoCE DPUs.
System-level experience specifically GPU-based systems (NVIDIA GPU and SDKs)

Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.

Rewards that Go Beyond

Flexible work arrangements
Full suite of health and wellness benefits
Ongoing training and development programs
Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!

Required Experience:

Senior IC

Key Skills

APIs
C/C++
Computer Graphics
Go
React
Redux
Node.js
AWS
Library Services
Assembly
GraphQL
High Voltage

Apply Now

About Company

Singtel

The Singtel Group, Asia's leading communications group provides a diverse range of services including fixed, mobile, data, internet, TV, infocomms technology (ICT) and digital solutions.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click