Master Principal Cloud Engineer – GPU & AI Infrastructure

Oracle

Job Location:

Beijing - China

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Position Overview

As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.

You will partner with Enterprise Sales teams to lead the technical discovery architectural design and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs) generative AI applications and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks RDMA networking and the software orchestration layers that make massive-scale GPU clusters hum.

Core Responsibilities

1. Strategic Technical Advisory

Architectural Design: Design end-to-end AI infrastructure solutions on OCI focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct accelerators.
Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g. training vs. inference FP8 vs. FP16 precision).
Networking Excellence: Design high-throughput low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCIs non-blocking leaf-spine architecture.

2. Hands-on Execution & Validation

Proof of Concept (PoC): Lead deep-dive technical evaluations demonstrating OCIs superior price-performance ratios for model training and fine-tuning.
Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack Triton Inference Server and NeMo Framework on OCI.
Performance Tuning: Work directly with engineering teams to troubleshoot bottleneckswhether they reside in the kernel the NCCL (NVIDIA Collective Communications Library) configuration or the storage IOPS.

3. Thought Leadership & Enablement

Content Creation: Develop whitepapers reference architectures and blog posts detailing OCIs competitive advantages in the AI sovereign cloud and private AI spaces.
Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators interconnects (InfiniBand vs. Ethernet) and distributed training frameworks (PyTorch JAX DeepSpeed).

Responsibilities

Required Technical Competencies

Domain	Expertise Required
GPU Architecture	Deep knowledge of CUDA cores Tensor Cores HBM3 memory and NVLink/NVSwitch topologies.
Networking	Mastery of RDMA RoCE and high-speed fabric management for multi-node distributed training.
Storage	Experience with high-performance parallel file systems like Lustre Weka or OCIs High-Performance Storage for feeding data to GPUs at scale.
Orchestration	Proficiency in Kubernetes (OKE) for AI Slurm for batch job scheduling and NVIDIA GPU Operator.
AI Frameworks	Hands-on experience with PyTorch TensorFlow and libraries for distributed computing like Megatron-LM.

Candidate Qualifications

Education: Bachelors or Masters degree in Computer Science Electrical Engineering or a related quantitative field.
Experience: 10 years in Pre-Sales Engineering Systems Architecture or HPC. At least 3 years specifically focused on GPU-accelerated computing.
The OCI Edge: Familiarity with OCIs Off-Box virtualization and how it enables Bare Metal performance in a cloud environment.
Communication: The ability to explain the difference between latency and throughput to a CTO while being able to debug a Python script with a Data Scientist.

Qualifications

Career Level - IC5

Required Experience:

Staff IC

DescriptionPosition OverviewAs a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; yo...

Description

Position Overview

Core Responsibilities

1. Strategic Technical Advisory

Architectural Design: Design end-to-end AI infrastructure solutions on OCI focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct accelerators.
Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g. training vs. inference FP8 vs. FP16 precision).
Networking Excellence: Design high-throughput low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCIs non-blocking leaf-spine architecture.

2. Hands-on Execution & Validation

Proof of Concept (PoC): Lead deep-dive technical evaluations demonstrating OCIs superior price-performance ratios for model training and fine-tuning.
Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack Triton Inference Server and NeMo Framework on OCI.
Performance Tuning: Work directly with engineering teams to troubleshoot bottleneckswhether they reside in the kernel the NCCL (NVIDIA Collective Communications Library) configuration or the storage IOPS.

3. Thought Leadership & Enablement

Content Creation: Develop whitepapers reference architectures and blog posts detailing OCIs competitive advantages in the AI sovereign cloud and private AI spaces.
Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators interconnects (InfiniBand vs. Ethernet) and distributed training frameworks (PyTorch JAX DeepSpeed).

Responsibilities

Required Technical Competencies

Domain	Expertise Required
GPU Architecture	Deep knowledge of CUDA cores Tensor Cores HBM3 memory and NVLink/NVSwitch topologies.
Networking	Mastery of RDMA RoCE and high-speed fabric management for multi-node distributed training.
Storage	Experience with high-performance parallel file systems like Lustre Weka or OCIs High-Performance Storage for feeding data to GPUs at scale.
Orchestration	Proficiency in Kubernetes (OKE) for AI Slurm for batch job scheduling and NVIDIA GPU Operator.
AI Frameworks	Hands-on experience with PyTorch TensorFlow and libraries for distributed computing like Megatron-LM.

Candidate Qualifications

Education: Bachelors or Masters degree in Computer Science Electrical Engineering or a related quantitative field.
Experience: 10 years in Pre-Sales Engineering Systems Architecture or HPC. At least 3 years specifically focused on GPU-accelerated computing.
The OCI Edge: Familiarity with OCIs Off-Box virtualization and how it enables Bare Metal performance in a cloud environment.
Communication: The ability to explain the difference between latency and throughput to a CTO while being able to debug a Python script with a Data Scientist.

Qualifications

Career Level - IC5

Required Experience:

Staff IC

Key Skills

Apply Now

About Company

Oracle

As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click