Senior Researcher AI Computing Systems

Zürich - Switzerland

Monthly Salary: Not Disclosed

Posted on: 18 hours ago

Vacancies: 1 Vacancy

Job Summary

Huawei envisions a world where technology connects people empowers industries and unlocks human potential. Guided by its mission to enrich lives through communication and intelligent innovation Huawei stands at the forefront of global digital transformation. As a leader in Information and Communications Technology (ICT) the company pioneers breakthroughs in artificial intelligence cloud computing and smart devices - building the intelligent foundation of a fully connected world.

Through its Carrier Enterprise and Consumer business groups Huawei delivers resilient digital infrastructure advanced cloud and AI platforms and transformative devices that enable progress at every level. Supporting 45 of the worlds top 50 telecom operators and serving one-third of the global population across more than 170 countries Huawei is shaping a future where connectivity becomes a powerful catalyst for opportunity and sustainable growth.

This spirit of bold innovation is embodied by Huawei Technologies Switzerland AG. From its research hubs in Zurich and Lausanne pioneering teams push the boundaries of High-Performance Computing Computer Architecture Computer Vision Robotics Artificial Intelligence Neuromorphic Computing Wireless Technologies and Networking - architecting the intelligent systems that will define tomorrows digital era.

We are looking for a strong researcher with hands-on LLM RAG experience who can help build and optimize techniques such as KV-cache precomputation KV reuse/blending (e.g. CacheBlend-style) and sparse attention / selective recompute. You will work close to the metal (attention kernels profiling) and at the system level (vLLM/LMCache-style stacks) turning research ideas into robust high-performance code.

Responsibilities:

Design and implement RAG acceleration techniques that reduce TTFT and improve throughput (e.g. document KV precomputation reuse caching policies).
Develop KV-cache reuse / blending pipelines and integrate them into inference stacks (batching paging eviction correctness/quality trade-offs).
Implement and optimize sparse attention / selective attention paths including mask construction and block-granularity strategies.
Work with PyTorch and modern attention backends/kernels (e.g. FlashAttention / FlashInfer-like kernels) profiling and optimizing performance.
Stay up to date with the latest research and open-source progress in LLM inference KV caching and RAG systems and translate it into practical improvements.

Qualifications:

PhD in Computer Science Electrical Engineering or a related field.
Strong software engineering skills in Python with substantial PyTorch experience (model internals attention/KV cache concepts performance-aware coding).
Solid understanding of transformer inference fundamentals: prefill vs decode KV cache layout masking batching latency/throughput trade-offs.
Experience benchmarking and profiling AI LLM workloads and diagnosing performance bottlenecks.
Strong communication skills and comfort collaborating across research engineering.

Preferred Qualifications (Nice to Have):

Experience with vLLM and/or LMCache (integration debugging extending attention/KV-cache logic).
Familiarity with attention kernel stacks and customization (FlashAttention/FlashInfer Triton CUDA extensions custom ops).
Practical experience building RAG pipelines (retrieval chunking indexing reranking) and understanding how retrieval interacts with inference latency.
Contributions to open-source projects or publications/technical reports in AI systems LLM inference caching or storage-aware ML systems.
Systems background (Linux performance engineering storage/IO memory hierarchy) and comfort working close to hardware.

Why join us:

Collaborate with world-class scientists and engineers in an open curiosity-driven environment.
Access to state-of-the-art technology and tools.
Opportunities for professional growth and development.
Competitive salary and a high quality of life in Zurich at the center of Europe.
Last but certainly not least: be part of innovative projects that make a difference.

Huawei envisions a world where technology connects people empowers industries and unlocks human potential. Guided by its mission to enrich lives through communication and intelligent innovation Huawei stands at the forefront of global digital transformation. As a leader in Information and Communicat...

Responsibilities:

Design and implement RAG acceleration techniques that reduce TTFT and improve throughput (e.g. document KV precomputation reuse caching policies).
Develop KV-cache reuse / blending pipelines and integrate them into inference stacks (batching paging eviction correctness/quality trade-offs).
Implement and optimize sparse attention / selective attention paths including mask construction and block-granularity strategies.
Work with PyTorch and modern attention backends/kernels (e.g. FlashAttention / FlashInfer-like kernels) profiling and optimizing performance.
Stay up to date with the latest research and open-source progress in LLM inference KV caching and RAG systems and translate it into practical improvements.

Qualifications:

PhD in Computer Science Electrical Engineering or a related field.
Strong software engineering skills in Python with substantial PyTorch experience (model internals attention/KV cache concepts performance-aware coding).
Solid understanding of transformer inference fundamentals: prefill vs decode KV cache layout masking batching latency/throughput trade-offs.
Experience benchmarking and profiling AI LLM workloads and diagnosing performance bottlenecks.
Strong communication skills and comfort collaborating across research engineering.

Preferred Qualifications (Nice to Have):

Experience with vLLM and/or LMCache (integration debugging extending attention/KV-cache logic).
Familiarity with attention kernel stacks and customization (FlashAttention/FlashInfer Triton CUDA extensions custom ops).
Practical experience building RAG pipelines (retrieval chunking indexing reranking) and understanding how retrieval interacts with inference latency.
Contributions to open-source projects or publications/technical reports in AI systems LLM inference caching or storage-aware ML systems.
Systems background (Linux performance engineering storage/IO memory hierarchy) and comfort working close to hardware.

Why join us:

Collaborate with world-class scientists and engineers in an open curiosity-driven environment.
Access to state-of-the-art technology and tools.
Opportunities for professional growth and development.
Competitive salary and a high quality of life in Zurich at the center of Europe.
Last but certainly not least: be part of innovative projects that make a difference.