Position Overview
System Software Engineers in this role operate at the intersection of LLM inference optimization and novel hardware bringup codesigning software abstractions with the hardware architecture team. You will extend leading opensource inference engines for CXLaware memory management and build an open software layer that enables any host server to leverage a CXLattached KVcache accelerator including cryptographic acceleration for confidential LLM inference on sensitive enterprise workloads.
Key Responsibilities
Extend advanced attention mechanisms in leading inference engines for CXLbased blocklevel KVcache offloading enabling seamless hot/cold tiering between local highbandwidth memory and CXLattached DDR5 pools on the target hardware platform.
Design and implement the Open KV Connector (OKC) protocol stack including hostside drivers and deviceside firmware so that inference engines can treat the platform as a firstclass CXL memory expander.
Benchmark and optimize TTFT (TimeToFirstToken) throughput and memory utilization on RISCV / CXL hardware and build automated performance regression suites.
Implement prefix caching KV quantization (FP8/INT4) and speculative KV eviction policies tuned for CXL latency characteristics.
Integrate FHE and TEEbased cryptographic acceleration for confidential inference workloads including homomorphic attention computation prototypes using modern opensource FHE frameworks.
Collaborate with hardware architects on NDP programming models and define software APIs for nearmemory compute offload.
Contribute upstream to major LLM inference and RISCV software ecosystem projects to establish the company as an opensource thought leader.
Support POC deployments with large cloud partners diagnosing and resolving performance bottlenecks in productionlike inference environments.
Required Skills & Experience
5 years in systems software with a focus on LLM inference optimization; direct experience with vLLM TensorRTLLM SGLang or comparable inference engines is required.
Expertlevel C and/or Rust skills plus strong Python for benchmarking tooling and experimentation.
Deep understanding of KVcache management algorithms: PagedAttention prefix caching chunked prefill speculative decoding and groupedquery attention (GQA/MQA).
Practical experience with custom memory allocators NUMAaware programming and direct devicememory access patterns.
Familiarity with PCIe/CXLstyle driver development whether in the Linux kernel or userspace accelerator frameworks.
Working knowledge of FHE schemes (CKKS BFV/BGV) and TEEbased confidential computing (e.g. Intel TDX AMD SEVSNP) as applied to ML inference workloads.
Strong debugging skills including use of perf flamegraphs memory profilers and cycleaccurate or detailed simulators.
Preferred Qualifications
Proven opensource contributions to vLLM SGLang or the broader RISCV software ecosystem.
Experience with hardware bringup including work on presilicon RTL simulation or earlysilicon platforms.
Background in distributed inference: disaggregated prefill/decode pipeline parallelism and tensor parallelism across CXLconnected nodes.
Familiarity with competing accelerator architectures with the ability to identify where OKC and the underlying platform must differentiate to drive enterprise adoption.
Position OverviewSystem Software Engineers in this role operate at the intersection of LLM inference optimization and novel hardware bringup codesigning software abstractions with the hardware architecture team. You will extend leading opensource inference engines for CXLaware memory management and ...
Position Overview
System Software Engineers in this role operate at the intersection of LLM inference optimization and novel hardware bringup codesigning software abstractions with the hardware architecture team. You will extend leading opensource inference engines for CXLaware memory management and build an open software layer that enables any host server to leverage a CXLattached KVcache accelerator including cryptographic acceleration for confidential LLM inference on sensitive enterprise workloads.
Key Responsibilities
Extend advanced attention mechanisms in leading inference engines for CXLbased blocklevel KVcache offloading enabling seamless hot/cold tiering between local highbandwidth memory and CXLattached DDR5 pools on the target hardware platform.
Design and implement the Open KV Connector (OKC) protocol stack including hostside drivers and deviceside firmware so that inference engines can treat the platform as a firstclass CXL memory expander.
Benchmark and optimize TTFT (TimeToFirstToken) throughput and memory utilization on RISCV / CXL hardware and build automated performance regression suites.
Implement prefix caching KV quantization (FP8/INT4) and speculative KV eviction policies tuned for CXL latency characteristics.
Integrate FHE and TEEbased cryptographic acceleration for confidential inference workloads including homomorphic attention computation prototypes using modern opensource FHE frameworks.
Collaborate with hardware architects on NDP programming models and define software APIs for nearmemory compute offload.
Contribute upstream to major LLM inference and RISCV software ecosystem projects to establish the company as an opensource thought leader.
Support POC deployments with large cloud partners diagnosing and resolving performance bottlenecks in productionlike inference environments.
Required Skills & Experience
5 years in systems software with a focus on LLM inference optimization; direct experience with vLLM TensorRTLLM SGLang or comparable inference engines is required.
Expertlevel C and/or Rust skills plus strong Python for benchmarking tooling and experimentation.
Deep understanding of KVcache management algorithms: PagedAttention prefix caching chunked prefill speculative decoding and groupedquery attention (GQA/MQA).
Practical experience with custom memory allocators NUMAaware programming and direct devicememory access patterns.
Familiarity with PCIe/CXLstyle driver development whether in the Linux kernel or userspace accelerator frameworks.
Working knowledge of FHE schemes (CKKS BFV/BGV) and TEEbased confidential computing (e.g. Intel TDX AMD SEVSNP) as applied to ML inference workloads.
Strong debugging skills including use of perf flamegraphs memory profilers and cycleaccurate or detailed simulators.
Preferred Qualifications
Proven opensource contributions to vLLM SGLang or the broader RISCV software ecosystem.
Experience with hardware bringup including work on presilicon RTL simulation or earlysilicon platforms.
Background in distributed inference: disaggregated prefill/decode pipeline parallelism and tensor parallelism across CXLconnected nodes.
Familiarity with competing accelerator architectures with the ability to identify where OKC and the underlying platform must differentiate to drive enterprise adoption.
View more
View less