As an DevOps Engineer Intern for SingTels GPU Cloud you will help in implementing processes and integration of operations to advance customers AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel RE:AI GPU Cloud. This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This is an excellent opportunity for someone eager to start their career in DevOps and grow their expertise in AI and HPC cloud platforms.
Responsibilities
- Assist in deploying and supporting GPU clusters for AI and ML workloads.
- Support automation tasks for provisioning GPU resources in on-prem and cloud platforms.
- Learn and contribute to CI/CD pipeline setup for AI models and GPU-accelerated applications.
- Monitor basic cluster usage health and performance under supervision.
- Assist in automating infrastructure provisioning and monitoring.
- Support troubleshooting of system-level issues (e.g. Slurm Kubernetes GPU drivers CUDA IB networking) with guidance from senior engineers.
- Participate in system benchmarking and stay updated on advancements in GPU technologies.
- Help set up monitoring and logging tools (e.g. Zabbix Prometheus NVIDIA DCGM).
- Learn and apply basic security practices in a multi-tenant GPU cloud environment.
- Collaborate with senior engineers and administrators to streamline workflows.
- Provide user support under supervision for GPU-accelerated systems.
- Work closely with senior DevOps engineers to identify bottlenecks and improve processes.
- Gain hands-on learning experience in high-performance distributed computation for AI and HPC workloads.
Requirements
- Currently pursuing a Bachelors degree in Computer Science/Engineering Information Technology Systems Engineering or a related field.
- Basic knowledge of Linux system administration (Ubuntu CentOS Rocky Linux etc.) through coursework or personal projects.
- Exposure to DevOps tools such as Jenkins Kubernetes Ansible or Terraform.
- Understanding of core DevOps concepts (e.g. CI/CD automation monitoring) with willingness to learn further.
- Familiarity with scripting languages (Python Bash) for simple tasks or assignments.
- Exposure to monitoring solutions such as Zabbix or Prometheus is a plus.
- Interest in AI frameworks such as TensorFlow or PyTorch with coursework or project experience preferred.
- Awareness of cloud architectures (IaaS PaaS) and GPU technologies including NVIDIA GPUs.
- Good verbal and written communication skills in English.
- Collaborative mindset and ability to work effectively in a team environment.
- Strong interest in developing problem-solving and analytical skills for system optimization.
Desirable qualifications
- Understanding of how collective communications (MPI RDMA and NCCL) works as well as an understanding of GPU specific aceleration works on GPU cluster.
- Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers Kubernetes data center deployments
- Understanding of AI & HPC networking technologies such as InfiniBand RoCE DPUs.
- Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.
Required Experience:
Intern