Operations Manager, GPUaaS

Singtel

Not Interested
Bookmark
Report This Job

profile Job Location:

Singapore - Singapore

profile Monthly Salary: Not Disclosed
Posted on: 13 hours ago
Vacancies: 1 Vacancy

Job Summary

About Singtel Digital InfraCo RE:AI

Singtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

Operations Manager GPU Operations is responsible for leading the day-to-day operations of Singtels GPU-as-a-Service (GPUaaS).

In this role you will lead the operations team to ensure the highest levels of system uptime availability security and performance and to deliver a stable and reliable GPUaaS service that meets defined service level objectives (SLA/SLO).

You will also act as the primary point of contact for the GPU infrastructure engineering team working closely with them to implement platform upgrades observability enhancements security features and continuous operational improvements across the GPUaaS platform.

Make an Impact by

  • Overall coordinator and primary point of contact for end-to-end GPU-as-a-Service (GPUaaS) operations including data centre operations reporting in accordance with the established organisational reporting structure
  • Lead day-to-day operations of GPU-as-a-Service and data centre operations including hardware environmental controls networking security and software.
  • Technical manager managing operations team vendors and consultants to administer GPU-as-a-Service (GPUaaS) operations during regular operations and in emergency situations.
  • Coordinate with internal teams vendors and consultants for the operation and implementation of GPU-as-a-Service (GPUaaS) enhancements and related data centre initiatives.
  • Implement validate and continuously improve plans to ensure the highest levels of operational stability for GPU-as-a-Service (GPUaaS) data centre operations including scenarios involving GPU cluster hardware software and related equipment as well as data centre infrastructure such as power or cooling outages.
  • Lead the resolution of incidents impacting GPU-as-a-Service (GPUaaS) environments including GPU cluster hardware software and related equipment as well as data centre infrastructure such as power or cooling outages; perform root cause analysis (RCA) as required and ensure findings are reported to customers and internal stakeholders within an appropriate and timely manner.
  • Present GPU-as-a-Service (GPUaaS) operational status and plans including data centre operations to senior management and relevant stakeholders.
  • Ensure incidents are responded and attended to or escalated for resolution based on criticality impact and SLA.
  • Build and lead a high-performing operations team to foster a culture of innovation collaboration and continuous improvement.
  • Set clear goals and objectives mentor team members and drive professional development initiatives.
  • Lead security incident management processes focusing on identification containment and resolution of threats.
  • Enforce best practices for security and compliance within the GPU-as-a-Service (GPUaaS) environment.
  • Stay abreast of industry security trends and implement measures to safeguard customer data and platform integrity.
  • This role may require availability outside standard work hours including nights weekends and public holidays.

Skills for Success

  • Bachelors degree in Computer Science Information Technology or related field.
  • Minimum of 8 years in data centre operations and management with at least 3 years in a leadership/managerial position.
  • Knowledge and experience in data centre infrastructure including servers networking storage physical and cybersecurity. Bonus for knowledge and experience in GPU cluster.
  • Well versed in various equipment maintenance and upkeep including electrical and mechanical.
  • Experience in leadership/managerial roles with excellent team management skills.
  • Organized and adaptive to changes in work schedules and arrangements.
  • Strong interpersonal and professional communications skills as well as presentation skills.
  • Proficiency in managing customer interactions and improving service delivery to enhance customer experience.

Desirable qualifications

  • Experienced in Linux and hypervisor administration for GPU infrastructure and GPUaaS.
  • Complex technical problem-solving with a proactive approach to system operation and optimization.
  • Knowledge of storage technologies and experience in capacity planning troubleshooting and data protection.
  • Experience in GPU and GPU infrastructure management including configuration monitoring and performance
  • Experience with liquid cooling systems specific to GPU infrastructure operation and monitoring.
  • Understanding of GPU cluster architectures and operations including GPU-based systems collective communications (e.g. NCCL RDMA) AI/HPC networking (e.g. InfiniBand) and containerized or orchestrated environments supporting AI and HPC workloads.

Rewards that Go Beyond

  • Flexible work arrangements
  • Full suite of health and wellness benefits
  • Ongoing training and development programs
  • Internal mobility opportunities

Your Career Growth Starts Here. Apply Now!


Required Experience:

Manager

About Singtel Digital InfraCo RE:AISingtel Digital InfraCos RE:AI division is building Asias most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises research institutions and digital-native businesses to accelerate innovation through responsible high-performance AI comp...
View more view more

Key Skills

  • Six Sigma
  • Lean
  • Management Experience
  • Process Improvement
  • Microsoft Outlook
  • Analysis Skills
  • Warehouse Management System
  • Operations Management
  • Kaizen
  • Leadership Experience
  • Supervising Experience
  • Retail Management

About Company

Company Logo

The Singtel Group, Asia's leading communications group provides a diverse range of services including fixed, mobile, data, internet, TV, infocomms technology (ICT) and digital solutions.

View Profile View Profile