- Specific contributions expected from the role:Infrastructure as Code (IaC); Inference Optimization: Develop and optimize high-throughput low-latency inference engines for LLMs (e.g. Llama 3 Mistral) using C and CUDA.
- Performance Engineering: Profile and eliminate bottlenecks in the software stack-from Python-level orchestration down to GPU kernel execution.
- Memory Management: Implement advanced memory techniques like KV Cache optimization PagedAttention and model quantization (INT8/FP8/AWQ) to maximize hardware utilization.
- Distributed Systems: Architect and maintain distributed serving systems capable of handling multi-node multi-GPU inference using technologies like Ray vLLM or TGI.
- Framework Integration: Build and maintain high-performance Python bindings (Pybind11) for C backends to expose system-level optimizations to the AI research team.
- Tooling & Observability: Build custom profiling tools and dashboards to monitor TTFT (Time to First Token) throughput and hardware telemetry (SMI)
- Proficiency in Large Models & Deep Neural Networks. Hands-on experience in working with large models & deep neural networks.
- Expertise in LLMs with working knowledge of large language models (LLMs).
- Extensive experience in System platform Architecture.
- Experience in Development Preferable for memory/storage/ any embedded system.
- In depth knowledge and extensive experience in dealing with Standardizations/Technical Papers/Patents.
- Extensive experience with C/C and Python programming.
Specific contributions expected from the role:Infrastructure as Code (IaC); Inference Optimization: Develop and optimize high-throughput low-latency inference engines for LLMs (e.g. Llama 3 Mistral) using C and CUDA. Performance Engineering: Profile and eliminate bottlenecks in the software stack-f...
- Specific contributions expected from the role:Infrastructure as Code (IaC); Inference Optimization: Develop and optimize high-throughput low-latency inference engines for LLMs (e.g. Llama 3 Mistral) using C and CUDA.
- Performance Engineering: Profile and eliminate bottlenecks in the software stack-from Python-level orchestration down to GPU kernel execution.
- Memory Management: Implement advanced memory techniques like KV Cache optimization PagedAttention and model quantization (INT8/FP8/AWQ) to maximize hardware utilization.
- Distributed Systems: Architect and maintain distributed serving systems capable of handling multi-node multi-GPU inference using technologies like Ray vLLM or TGI.
- Framework Integration: Build and maintain high-performance Python bindings (Pybind11) for C backends to expose system-level optimizations to the AI research team.
- Tooling & Observability: Build custom profiling tools and dashboards to monitor TTFT (Time to First Token) throughput and hardware telemetry (SMI)
- Proficiency in Large Models & Deep Neural Networks. Hands-on experience in working with large models & deep neural networks.
- Expertise in LLMs with working knowledge of large language models (LLMs).
- Extensive experience in System platform Architecture.
- Experience in Development Preferable for memory/storage/ any embedded system.
- In depth knowledge and extensive experience in dealing with Standardizations/Technical Papers/Patents.
- Extensive experience with C/C and Python programming.
View more
View less