Be a Part of Something BIG!
Make an Impact by
To lead and manage the GPU Infrastructure-as-a-Service (IaaS) platform. This role will oversee the GPU infrastructure storage infrastructure and associated services ensuring seamless integration and operation.
Infrastructure and Resource Management:
- Manage the maintenance and operations of Data centre with liquid cooling setup that hosts the GPU cloud.
- Optimization of GPU infrastructure and associated hardware.
- Optimize resource allocation to meet the performance requirements of both data centre operations and cloud hardware operations as well as cost-effectiveness goals.
- Lead the operations team to ensure compliance to the SLA needs of customers and the product.
- Enhance system scalability and reliability through automation and continuous improvements. Enforce industry-standard operational process with reference to standards like ISO 27001 or equivalent in the data centre and cloud operations
Operational Excellence:
- Handle general incidents including operations management and escalation management across the AI cloud product.
- Develop and implement operational strategies to ensure the reliability and efficiency of our GPU Cloud infrastructure.
- Collaborate with other departments to streamline processes enhance customer experience and meet service level agreements.
- Support services and improve the lifecycle of GPU cloud hardware and the data centre environment with monitoring logging and alerting through deployment operation and refinement.
- Establish Ops systems/processes (SOPs EOPs etc) and to manage daily operational issues.
- Possess strong operational management skill set which involves organising the internal cross functional teams and external vendors to ensure an efficient and resilient ops setup.
Team Management:
- Build and lead a high-performing operations team to foster a culture of innovation collaboration and continuous improvement.
- Set clear goals and objectives mentor team members and drive professional development initiatives.
- Oversee resource management and allocation to optimize team productivity and effectively meet operation goals.
Security and Compliance:
- Lead security incident management processes focusing on identification containment and resolution of threats in the data center environment and GPU cloud hardware.
- Enforce best practices for security and compliance.
- Stay abreast of industry security trends and implement measures to safeguard customer data and platform integrity.
Skills for Success
- Proven track record of managing and escalating complex cloud and data centre infrastructure issues and leading operation teams.
- Experience in liquid cooling operations would be great
- Strong understanding of hardware infrastructure operation security management and best practices.
- Excellent leadership communication and interpersonal skills with the ability to lead cross-functional teams.
- Proficiency in managing customer interactions and improving service delivery to enhance customer experience.
- Experienced in Linux and hypervisor administration for GPU infrastructure and cloud.
- Complex technical problem-solving with a proactive approach to system operation and optimization.
- Knowledge of storage technologies and experience in capacity planning troubleshooting and data protection.
- Experience in GPU and GPU infrastructure management including configuration monitoring and performance.
Rewards that Go Beyond
- Flexible work arrangements
- Full suite of health and wellness benefits
- Ongoing training and development programs
- Internal mobility opportunities
Your Career Growth Starts Here. Apply Now!
Required Experience:
Director