About Singtel Digital InfraCo RE:AI
Singtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI compute and connectivity solutions.
Be a Part of Something BIG!
As an DevOps Engineer for SingTels GPU-as-a-Service (GPUaaS) you will help in implementing processes and integration of operations to advance customers AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel GPU-as-a-Service (GPUaaS).This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This is an excellent opportunity for someone eager to start their career in DevOps and grow their expertise in AI and HPC cloud platforms.
Responsibilities
- Design deploy and support large-scale distribute GPU clusters for AI and ML workloads.
- Manage and automate provisioning of GPU resources in both on-prem and cloud platforms.
- Design implement and manage CI/CD pipelines for AI models and GPU-accelerated applications.
- Monitor cluster usage health performance and availability.
- Improve infrastructure provisioning management and monitoring through automation.
- Troubleshoot compute resource system level issues such as Slurm Kubernetes GPU drivers CUDA IB networking.
- Optimize system parameters (e.g. OS drivers networking library) for AI workload performance.
- Conduct GPU cluster benchmark and keeping up with the latest advancements in GPU technology.
- Set up monitoring and logging for GPU resources using Zabbix Prometheus NVIDIA DCGM and other tools.
- Implement security best-practices for multi-tenant GPU-as-a-Service (GPUaaS) environment.
- Collaborate with software and administrator to to streamline workflows and improve collaboration.
- Providing technical support and guidance to users of GPU-accelerated systems.
- Work with senior DevOps engineer to identify bottlenecks and improve development and operational processes for AI and HPC GPU cloud.
- Learning to solve problems in high-performance distributed computation for AI and HPC GPU cloud computing.
- This role may require availability outside standard work hours including nights weekends and public holidays.
Requirements
- Bachelors degree in Computer Science/Engineering Information Technology Systems Engineering or a related field.
- Strong Linux system administration skills in Ubuntu/CentOS/Rocky Linux etc.
- Experience with DevOps tools such as Jenkins Kubernetes Ansible and Terraform.
- Solid understanding of DevOps practices including CI/CD automation and monitoring.
- Proficiency in scripting languages (e.g. Python Bash).
- Experience in implementing monitoring solutions such as Zabbix Prometheus.
- Familiarity with AI frameworks such as TensorFlow PyTorch.
- Understanding of cloud architectures (IaaS PaaS) GPU architecture and NVIDIA GPUs.
- Strong verbal written and presentation skills in English.
- Team player with experience in cross-functional coordination.
- Strong technical problem solving and analytical skills for system optimization.
Desirable qualifications
- Understanding of how collective communications (MPI RDMA and NCCL) works as well as an understanding of GPU specific aceleration works on GPU cluster.
- Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers Kubernetes data center deployments
- Familiarity with Slurm or other HPC workload managers to manage GPU clusters.
- Understanding of AI & HPC networking technologies such as InfiniBand RoCE DPUs.
- System-level experience specifically GPU-based systems (NVIDIA GPU and SDKs)
- Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.
Rewards that Go Beyond
- Flexible work arrangements
- Full suite of health and wellness benefits
- Ongoing training and development programs
- Internal mobility opportunities
Your Career Growth Starts Here. Apply Now!
Required Experience:
Senior IC
About Singtel Digital InfraCo RE:AISingtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI comp...
About Singtel Digital InfraCo RE:AI
Singtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI compute and connectivity solutions.
Be a Part of Something BIG!
As an DevOps Engineer for SingTels GPU-as-a-Service (GPUaaS) you will help in implementing processes and integration of operations to advance customers AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel GPU-as-a-Service (GPUaaS).This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This is an excellent opportunity for someone eager to start their career in DevOps and grow their expertise in AI and HPC cloud platforms.
Responsibilities
- Design deploy and support large-scale distribute GPU clusters for AI and ML workloads.
- Manage and automate provisioning of GPU resources in both on-prem and cloud platforms.
- Design implement and manage CI/CD pipelines for AI models and GPU-accelerated applications.
- Monitor cluster usage health performance and availability.
- Improve infrastructure provisioning management and monitoring through automation.
- Troubleshoot compute resource system level issues such as Slurm Kubernetes GPU drivers CUDA IB networking.
- Optimize system parameters (e.g. OS drivers networking library) for AI workload performance.
- Conduct GPU cluster benchmark and keeping up with the latest advancements in GPU technology.
- Set up monitoring and logging for GPU resources using Zabbix Prometheus NVIDIA DCGM and other tools.
- Implement security best-practices for multi-tenant GPU-as-a-Service (GPUaaS) environment.
- Collaborate with software and administrator to to streamline workflows and improve collaboration.
- Providing technical support and guidance to users of GPU-accelerated systems.
- Work with senior DevOps engineer to identify bottlenecks and improve development and operational processes for AI and HPC GPU cloud.
- Learning to solve problems in high-performance distributed computation for AI and HPC GPU cloud computing.
- This role may require availability outside standard work hours including nights weekends and public holidays.
Requirements
- Bachelors degree in Computer Science/Engineering Information Technology Systems Engineering or a related field.
- Strong Linux system administration skills in Ubuntu/CentOS/Rocky Linux etc.
- Experience with DevOps tools such as Jenkins Kubernetes Ansible and Terraform.
- Solid understanding of DevOps practices including CI/CD automation and monitoring.
- Proficiency in scripting languages (e.g. Python Bash).
- Experience in implementing monitoring solutions such as Zabbix Prometheus.
- Familiarity with AI frameworks such as TensorFlow PyTorch.
- Understanding of cloud architectures (IaaS PaaS) GPU architecture and NVIDIA GPUs.
- Strong verbal written and presentation skills in English.
- Team player with experience in cross-functional coordination.
- Strong technical problem solving and analytical skills for system optimization.
Desirable qualifications
- Understanding of how collective communications (MPI RDMA and NCCL) works as well as an understanding of GPU specific aceleration works on GPU cluster.
- Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers Kubernetes data center deployments
- Familiarity with Slurm or other HPC workload managers to manage GPU clusters.
- Understanding of AI & HPC networking technologies such as InfiniBand RoCE DPUs.
- System-level experience specifically GPU-based systems (NVIDIA GPU and SDKs)
- Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.
Rewards that Go Beyond
- Flexible work arrangements
- Full suite of health and wellness benefits
- Ongoing training and development programs
- Internal mobility opportunities
Your Career Growth Starts Here. Apply Now!
Required Experience:
Senior IC
View more
View less