Operations Manager, GPUaaS

Singtel

Not Interested
Bookmark
Report This Job

profile Job Location:

Singapore - Singapore

profile Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

About Singtel Digital InfraCo RE:AI

Singtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

Operations Manager GPU Operations is responsible for leading the day-to-day operations of Singtels GPU-as-a-Service (GPUaaS) platform. This role ensures high levels of system availability performance security and reliability across GPU infrastructure and supporting data centre operations.

The role serves as the primary operational interface with GPU infrastructure engineering teams collaborating on platform upgrades observability security enhancements and continuous operational improvements.

Make an Impact by

  • Acting as the overall coordinator and primary point of contact for end-to-end GPUaaS operations including data centre operations and operational reporting.
  • Leading daily GPUaaS and data centre operations covering hardware environmental controls networking security and supporting software platforms.
  • Managing operations teams vendors and consultants during both normal operations and emergency situations.
  • Coordinating with internal teams and external partners to implement GPUaaS enhancements and data centre initiatives.
  • Implementing validating and continuously improving operational plans to ensure platform stability across GPU hardware software and data centre infrastructure (e.g. power and cooling).
  • Leading incident response and resolution for GPUaaS environments including root cause analysis (RCA) and timely communication to customers and stakeholders.
  • Presenting operational status risks and improvement plans to senior management and relevant stakeholders.
  • Ensuring incidents are addressed or escalated in accordance with criticality impact and SLA/SLO requirements.
  • Building and leading a high-performing operations team fostering collaboration innovation and continuous improvement.
  • Setting clear goals mentoring team members and supporting professional development.
  • Leading security incident management and enforcing security and compliance best practices within the GPUaaS environment.
  • Monitoring industry security trends and implementing measures to protect customer data and platform integrity.
  • Participating in scheduled or on-call support outside standard working hours as required.

Skills for Success

  • Bachelors degree in Computer Science Information Technology or a related discipline.
  • Minimum of 8 years experience in data centre operations and management including at least 3 years in a leadership or managerial role.
  • Strong knowledge of data centre infrastructure including servers networking storage physical security and cybersecurity.
  • Experience with electrical and mechanical systems maintenance and facilities operations.
  • Proven people leadership and vendor management capabilities.
  • Strong organisational skills and adaptability to changing operational requirements.
  • Effective interpersonal communication and presentation skills.
  • Experience managing customer interactions and driving service quality improvements.

Desirable qualifications

  • Experienced in Linux and hypervisor administration for GPU infrastructure and GPUaaS.
  • Complex technical problem-solving with a proactive approach to system operation and optimization.
  • Knowledge of storage technologies and experience in capacity planning troubleshooting and data protection.
  • Experience in GPU and GPU infrastructure management including configuration monitoring and performance
  • Experience with liquid cooling systems specific to GPU infrastructure operation and monitoring.
  • Understanding of GPU cluster architectures and operations including GPU-based systems collective communications (e.g. NCCL RDMA) AI/HPC networking (e.g. InfiniBand) and containerized or orchestrated environments supporting AI and HPC workloads.

Rewards that Go Beyond

  • Flexible work arrangements
  • Full suite of health and wellness benefits
  • Ongoing training and development programs
  • Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!


Required Experience:

Manager

About Singtel Digital InfraCo RE:AISingtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI comp...
View more view more

Key Skills

  • Six Sigma
  • Lean
  • Management Experience
  • Process Improvement
  • Microsoft Outlook
  • Analysis Skills
  • Warehouse Management System
  • Operations Management
  • Kaizen
  • Leadership Experience
  • Supervising Experience
  • Retail Management

About Company

Company Logo

The Singtel Group, Asia's leading communications group provides a diverse range of services including fixed, mobile, data, internet, TV, infocomms technology (ICT) and digital solutions.

View Profile View Profile