P-1285
About This Role
As a staff software engineer for GenAI Performance and Kernel you will own the design implementation optimization and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned low-level compute paths manage trade-offs between hardware efficiency and generality and mentor others in kernel-level performance engineering. You will work closely with ML researchers systems engineers and product teams to push the state-of-the-art in inference performance at scale.
What You Will Do
- Lead the design implementation benchmarking and maintenance of core compute kernels (e.g. attention MLP softmax layernorm memory management) optimized for various hardware backends (GPU accelerators)
- Drive the performance roadmap for kernel-level improvements: vectorization tensorization tiling fusion mixed precision sparsity quantization memory reuse scheduling auto-tuning etc.
- Integrate kernel optimizations with higher-level ML systems
- Build and maintain profiling instrumentation and verification tooling to detect correctness performance regressions numerical issues and hardware utilization gaps
- Lead performance investigations and root-cause analysis on inference bottlenecks e.g. memory bandwidth cache contention kernel launch overhead tensor fragmentation
- Establish coding patterns abstractions and frameworks to modularize kernels for reuse cross-backend portability and maintainability
- Influence system architecture decisions to make kernel improvements more effective (e.g. memory layout dataflow scheduling kernel fusion boundaries)
- Mentor and guide other engineers working on lower-level performance provide code reviews help set best practices
- Collaborate with infrastructure tooling and ML teams to roll out kernel-level optimizations into production and monitor their impact
What We Look For
- BS/MS/PhD in Computer Science or a related field
- Deep hands-on experience writing and tuning compute kernels (CUDA Triton OpenCL LLVM IR assembly or similar sort) for ML workloads
- Strong knowledge of GPU/accelerator architecture: warp structure memory hierarchy (global shared register L1/L2 caches) tensor cores scheduling SM occupancy etc.
- Experience with advanced optimization techniques: tiling blocking software pipelining vectorization fusion loop transformations auto-tuning
- Familiarity with ML-specific kernel libraries (cuBLAS cuDNN CUTLASS oneDNN etc.) or open kernels
- Strong debugging and profiling skills (Nsight NVProf perf vtune custom instrumentation)
- Experience reasoning about numerical stability mixed precision quantization and error propagation
- Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines memory management and runtime systems
- Experience building high-performance products leveraging GPU acceleration
- Excellent communication and leadership skills able to drive design discussions mentor colleagues and make trade-offs visible
- A track record of shipping performance-critical high-quality production software
- Bonus: published in systems/ML performance venues (e.g. MLSys ASPLOS ISCA PPoPP) experience with custom accelerators or FPGA experience with sparsity or model compression techniques
Required Experience:
Staff IC
P-1285About This RoleAs a staff software engineer for GenAI Performance and Kernel you will own the design implementation optimization and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned low-level compute paths manage trad...
P-1285
About This Role
As a staff software engineer for GenAI Performance and Kernel you will own the design implementation optimization and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned low-level compute paths manage trade-offs between hardware efficiency and generality and mentor others in kernel-level performance engineering. You will work closely with ML researchers systems engineers and product teams to push the state-of-the-art in inference performance at scale.
What You Will Do
- Lead the design implementation benchmarking and maintenance of core compute kernels (e.g. attention MLP softmax layernorm memory management) optimized for various hardware backends (GPU accelerators)
- Drive the performance roadmap for kernel-level improvements: vectorization tensorization tiling fusion mixed precision sparsity quantization memory reuse scheduling auto-tuning etc.
- Integrate kernel optimizations with higher-level ML systems
- Build and maintain profiling instrumentation and verification tooling to detect correctness performance regressions numerical issues and hardware utilization gaps
- Lead performance investigations and root-cause analysis on inference bottlenecks e.g. memory bandwidth cache contention kernel launch overhead tensor fragmentation
- Establish coding patterns abstractions and frameworks to modularize kernels for reuse cross-backend portability and maintainability
- Influence system architecture decisions to make kernel improvements more effective (e.g. memory layout dataflow scheduling kernel fusion boundaries)
- Mentor and guide other engineers working on lower-level performance provide code reviews help set best practices
- Collaborate with infrastructure tooling and ML teams to roll out kernel-level optimizations into production and monitor their impact
What We Look For
- BS/MS/PhD in Computer Science or a related field
- Deep hands-on experience writing and tuning compute kernels (CUDA Triton OpenCL LLVM IR assembly or similar sort) for ML workloads
- Strong knowledge of GPU/accelerator architecture: warp structure memory hierarchy (global shared register L1/L2 caches) tensor cores scheduling SM occupancy etc.
- Experience with advanced optimization techniques: tiling blocking software pipelining vectorization fusion loop transformations auto-tuning
- Familiarity with ML-specific kernel libraries (cuBLAS cuDNN CUTLASS oneDNN etc.) or open kernels
- Strong debugging and profiling skills (Nsight NVProf perf vtune custom instrumentation)
- Experience reasoning about numerical stability mixed precision quantization and error propagation
- Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines memory management and runtime systems
- Experience building high-performance products leveraging GPU acceleration
- Excellent communication and leadership skills able to drive design discussions mentor colleagues and make trade-offs visible
- A track record of shipping performance-critical high-quality production software
- Bonus: published in systems/ML performance venues (e.g. MLSys ASPLOS ISCA PPoPP) experience with custom accelerators or FPGA experience with sparsity or model compression techniques
Required Experience:
Staff IC
View more
View less