DescriptionPosition Overview
As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.
You will partner with Enterprise Sales teams to lead the technical discovery architectural design and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs) generative AI applications and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks RDMA networking and the software orchestration layers that make massive-scale GPU clusters hum.
Core Responsibilities
1. Strategic Technical Advisory
- Architectural Design: Design end-to-end AI infrastructure solutions on OCI focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct accelerators.
- Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g. training vs. inference FP8 vs. FP16 precision).
- Networking Excellence: Design high-throughput low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCIs non-blocking leaf-spine architecture.
2. Hands-on Execution & Validation
- Proof of Concept (PoC): Lead deep-dive technical evaluations demonstrating OCIs superior price-performance ratios for model training and fine-tuning.
- Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack Triton Inference Server and NeMo Framework on OCI.
- Performance Tuning: Work directly with engineering teams to troubleshoot bottleneckswhether they reside in the kernel the NCCL (NVIDIA Collective Communications Library) configuration or the storage IOPS.
3. Thought Leadership & Enablement
- Content Creation: Develop whitepapers reference architectures and blog posts detailing OCIs competitive advantages in the AI sovereign cloud and private AI spaces.
- Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators interconnects (InfiniBand vs. Ethernet) and distributed training frameworks (PyTorch JAX DeepSpeed).
ResponsibilitiesRequired Technical Competencies
Domain | Expertise Required |
GPU Architecture | Deep knowledge of CUDA cores Tensor Cores HBM3 memory and NVLink/NVSwitch topologies. |
Networking | Mastery of RDMA RoCE and high-speed fabric management for multi-node distributed training. |
Storage | Experience with high-performance parallel file systems like Lustre Weka or OCIs High-Performance Storage for feeding data to GPUs at scale. |
Orchestration | Proficiency in Kubernetes (OKE) for AI Slurm for batch job scheduling and NVIDIA GPU Operator. |
AI Frameworks | Hands-on experience with PyTorch TensorFlow and libraries for distributed computing like Megatron-LM. |
Candidate Qualifications
- Education: Bachelors or Masters degree in Computer Science Electrical Engineering or a related quantitative field.
- Experience: 10 years in Pre-Sales Engineering Systems Architecture or HPC. At least 3 years specifically focused on GPU-accelerated computing.
- The OCI Edge: Familiarity with OCIs Off-Box virtualization and how it enables Bare Metal performance in a cloud environment.
- Communication: The ability to explain the difference between latency and throughput to a CTO while being able to debug a Python script with a Data Scientist.
QualificationsCareer Level - IC5
Required Experience:
Staff IC
DescriptionPosition OverviewAs a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; yo...
DescriptionPosition Overview
As a GPU Specialist Cloud Engineer (CE) within the Oracle Cloud Infrastructure (OCI) Pre-Sales organization you will serve as the primary technical authority for high-performance computing (HPC) and Artificial Intelligence infrastructure. You are not just a generalist; you are the bridge between complex silicon capabilities and transformative business outcomes.
You will partner with Enterprise Sales teams to lead the technical discovery architectural design and proof-of-concept (PoC) execution for customers building the next generation of Large Language Models (LLMs) generative AI applications and computationally intensive simulations. This role requires a deep understanding of NVIDIA/AMD hardware stacks RDMA networking and the software orchestration layers that make massive-scale GPU clusters hum.
Core Responsibilities
1. Strategic Technical Advisory
- Architectural Design: Design end-to-end AI infrastructure solutions on OCI focusing on Superclusters that leverage NVIDIA H200/B300/GB300 or AMD Instinct accelerators.
- Optimization: Advise customers on right-sizing GPU shapes based on workload requirements (e.g. training vs. inference FP8 vs. FP16 precision).
- Networking Excellence: Design high-throughput low-latency interconnect fabrics using RoCE v2 (RDMA over Converged Ethernet) and OCIs non-blocking leaf-spine architecture.
2. Hands-on Execution & Validation
- Proof of Concept (PoC): Lead deep-dive technical evaluations demonstrating OCIs superior price-performance ratios for model training and fine-tuning.
- Stack Integration: Assist customers in deploying and optimizing the NVIDIA AI Enterprise stack Triton Inference Server and NeMo Framework on OCI.
- Performance Tuning: Work directly with engineering teams to troubleshoot bottleneckswhether they reside in the kernel the NCCL (NVIDIA Collective Communications Library) configuration or the storage IOPS.
3. Thought Leadership & Enablement
- Content Creation: Develop whitepapers reference architectures and blog posts detailing OCIs competitive advantages in the AI sovereign cloud and private AI spaces.
- Market Intelligence: Stay ahead of the curve on the evolving landscape of AI accelerators interconnects (InfiniBand vs. Ethernet) and distributed training frameworks (PyTorch JAX DeepSpeed).
ResponsibilitiesRequired Technical Competencies
Domain | Expertise Required |
GPU Architecture | Deep knowledge of CUDA cores Tensor Cores HBM3 memory and NVLink/NVSwitch topologies. |
Networking | Mastery of RDMA RoCE and high-speed fabric management for multi-node distributed training. |
Storage | Experience with high-performance parallel file systems like Lustre Weka or OCIs High-Performance Storage for feeding data to GPUs at scale. |
Orchestration | Proficiency in Kubernetes (OKE) for AI Slurm for batch job scheduling and NVIDIA GPU Operator. |
AI Frameworks | Hands-on experience with PyTorch TensorFlow and libraries for distributed computing like Megatron-LM. |
Candidate Qualifications
- Education: Bachelors or Masters degree in Computer Science Electrical Engineering or a related quantitative field.
- Experience: 10 years in Pre-Sales Engineering Systems Architecture or HPC. At least 3 years specifically focused on GPU-accelerated computing.
- The OCI Edge: Familiarity with OCIs Off-Box virtualization and how it enables Bare Metal performance in a cloud environment.
- Communication: The ability to explain the difference between latency and throughput to a CTO while being able to debug a Python script with a Data Scientist.
QualificationsCareer Level - IC5
Required Experience:
Staff IC
View more
View less