GPU Optimization Engineer

Techire Ai

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 2 days ago
Vacancies: 1 Vacancy

Job Summary

Job Description

GPU Optimisation Engineer Real-Time Inference

Want to push GPU performance to its limits not in theory but in production systems handling real-time speech and multimodal workloads

This team is building low-latency AI systems where milliseconds actually matter. The target isnt faster than baseline. Its sub-50ms time-to-first-token at 100 concurrent requests on a single H100 while maintaining model quality.

Theyre hiring a GPU Optimisation Engineer who understands GPUs at an architectural level. Someone who knows where performance is really lost: memory hierarchy kernel launch overhead occupancy limits scheduling inefficiencies KV cache behaviour attention paths. The work sits close to the metal inside inference execution not general infra not model research.

Youll operate across the kernel and runtime layers profiling large-scale speech and multimodal models end-to-end and removing bottlenecks wherever they appear.

What youll work on

  • Profiling GPU bottlenecks across memory bandwidth kernel fusion quantisation and scheduling

  • Writing and tuning custom CUDA / Triton kernels for performance-critical paths

  • Improving attention decoding and KV cache efficiency in inference runtimes

  • Modifying and extending vLLM-style systems to better suit real-time workloads

  • Optimising models to fit GPU memory constraints without degrading output quality

  • Benchmarking across NVIDIA GPUs (with exposure to AMD and other accelerators over time)

  • Partnering directly with research to turn new model ideas into fast production-ready inference

This is hands-on optimisation work across the stack. No layers of bureaucracy. No platform ownership theatre. Just deep performance engineering applied to models that are actively evolving.

What tends to work well

  • Strong experience with CUDA and/or Triton

  • Deep understanding of GPU execution (memory hierarchy scheduling occupancy concurrency)

  • Experience optimising inference latency and throughput for large generative models

  • Familiarity with attention kernels decoding paths or LLM-style runtimes

  • Comfort profiling with low-level GPU tooling

The company is revenue-generating its models are used by global enterprises and the SF R&D team is expanding following a recent raise. This is growth hiring not backfill.

Package & location

  • Base salary: up to $300000 (negotiable based on depth)

  • Equity: Meaningful stock

  • Location: San Francisco preferred (relocation and visa sponsorship can be provided)

If you care about real-time constraints GPU architecture and squeezing every last millisecond out of large models this is worth a conversation.

All applicants will receive a response.


Required Experience:

IC

Job DescriptionGPU Optimisation Engineer Real-Time InferenceWant to push GPU performance to its limits not in theory but in production systems handling real-time speech and multimodal workloadsThis team is building low-latency AI systems where milliseconds actually matter. The target isnt faster t...
View more view more

Key Skills

  • Automotive
  • Atl
  • Architectural Design
  • HAAD RN
  • Jboss
  • Accident Investigation