Master Principal Cloud Engineer – GPU & AI Infrastructure

Oracle

Not Interested
Bookmark
Report This Job

profile Job Location:

Beijing - China

profile Monthly Salary: Not Disclosed
Posted on: 2 days ago
Vacancies: 1 Vacancy

Job Summary

Description

Position Overview

As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.

You will partner with Enterprise Sales teams to lead the technical discovery architectural design and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs) generative AI applications and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks RDMA networking and the software orchestration layers that make massive-scale GPU clusters hum.

Core Responsibilities

1. Strategic Technical Advisory

  • Architectural Design: Design end-to-end AI infrastructure solutions on OCI focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct accelerators.
  • Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g. training vs. inference FP8 vs. FP16 precision).
  • Networking Excellence: Design high-throughput low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCIs non-blocking leaf-spine architecture.

2. Hands-on Execution & Validation

  • Proof of Concept (PoC): Lead deep-dive technical evaluations demonstrating OCIs superior price-performance ratios for model training and fine-tuning.
  • Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack Triton Inference Server and NeMo Framework on OCI.
  • Performance Tuning: Work directly with engineering teams to troubleshoot bottleneckswhether they reside in the kernel the NCCL (NVIDIA Collective Communications Library) configuration or the storage IOPS.

3. Thought Leadership & Enablement

  • Content Creation: Develop whitepapers reference architectures and blog posts detailing OCIs competitive advantages in the AI sovereign cloud and private AI spaces.
  • Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators interconnects (InfiniBand vs. Ethernet) and distributed training frameworks (PyTorch JAX DeepSpeed).


Responsibilities

Required Technical Competencies

Domain

Expertise Required

GPU Architecture

Deep knowledge of CUDA cores Tensor Cores HBM3 memory and NVLink/NVSwitch topologies.

Networking

Mastery of RDMA RoCE and high-speed fabric management for multi-node distributed training.

Storage

Experience with high-performance parallel file systems like Lustre Weka or OCIs High-Performance Storage for feeding data to GPUs at scale.

Orchestration

Proficiency in Kubernetes (OKE) for AI Slurm for batch job scheduling and NVIDIA GPU Operator.

AI Frameworks

Hands-on experience with PyTorch TensorFlow and libraries for distributed computing like Megatron-LM.

Candidate Qualifications

  • Education: Bachelors or Masters degree in Computer Science Electrical Engineering or a related quantitative field.
  • Experience: 10 years in Pre-Sales Engineering Systems Architecture or HPC. At least 3 years specifically focused on GPU-accelerated computing.
  • The OCI Edge: Familiarity with OCIs Off-Box virtualization and how it enables Bare Metal performance in a cloud environment.
  • Communication: The ability to explain the difference between latency and throughput to a CTO while being able to debug a Python script with a Data Scientist.



Qualifications

Career Level - IC5




Required Experience:

Staff IC

DescriptionPosition OverviewAs a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; yo...
View more view more

Key Skills

  • Jenkins
  • Ruby
  • Python
  • Active Directory
  • Cloud
  • PowerShell
  • Windows
  • AWS
  • Linux
  • SAN
  • Java
  • Troubleshoot
  • Backup
  • Puppet
  • hardware

About Company

Company Logo

As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity. We know that true innovation starts when eve ... View more

View Profile View Profile