Member of Technical Staff Foundations

San Francisco, CA - USA

Monthly Salary: $ 200 - 500

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Tzafon is a foundation model lab building scalable compute systems and advancing machine intelligence with offices in San Francisco Zurich & Tel Aviv. Weve raised over $12m in funding to advance our mission of expanding the frontiers of machine intelligence.

Were a team of engineers and scientists with deep backgrounds in ML infrastructure & research. Founded by IOI and IMO medalists PhDs and alumni from leading tech companies such as Google Deepmind Character and NVIDIA we train models and build infrastructure for swarms of agents to automate work across real-world environments.

Youll work between our product and post-training teams to ship Large Action Models that actually work. Build evals benchmarks and fine-tuning pipelines. Define what good model behavior means and make it happen at scale.

What youll do

Design and execute large scale training runs on our clusters
Build and optimize distributed training infrastructure across massive multi-node systems
Implement post-training pipelines at scale
Develop data pipelines that process and filter trillions of tokens for pre-training
Research and implement architectural improvements scaling laws and training optimizations
Debug training instabilities loss spikes and convergence issues in long-running jobs
Build tooling for cluster utilization fault tolerance and checkpoint management
Write custom CUDA/Triton kernels to optimize critical training operations (attention normalization activations)
Collaborate on research that advances the state of the art in foundation model training

Were looking for

Deep experience pre-training or post-training foundation models on large clusters
Expert-level at Python and ML frameworks (PyTorch JAX Torchtitan)
Strong systems skills: distributed training FSDP/ZeRO tensor parallelism pipeline parallelism
Experience writing performant CUDA or Triton kernels for ML workloads
Track record of running stable multi-week training jobs and debugging distributed training failures
Understanding of cluster scheduling networking bottlenecks and GPU/TPU performance optimization

Preferred Experience

Trained foundation models at major AI labs (OpenAI Anthropic Google DeepMind Meta xAI etc.)
Worked on large scale RL runs
Optimized critical training kernels (FlashAttention fused optimizers custom kernels)
Published research at top ML conferences (NeurIPS ICML ICLR)
Contributions to open source ML infrastructure (PyTorch JAX vLLM etc.)
Experience with training data pipelines data quality research or synthetic data generation

Life at Tzafon

Full medical dental and vision coverage plus 401(k) in the us
Office in SF Zurich and Tel Aviv
Early-stage equity in a future-defining company

Visa sponsorship: We do sponsor visas! However we arent able to successfully sponsor visas for every role and every candidate. But if we make you an offer we will make every reasonable effort to get you a visa and we retain an immigration lawyer to help with this.

Compensation starts at $200k-$500k equity package depending on experience & location.

We also offer a referral bonus of $5k for referral of successful hires (send to ).

Required Experience:

Staff IC

What youll do

Design and execute large scale training runs on our clusters
Build and optimize distributed training infrastructure across massive multi-node systems
Implement post-training pipelines at scale
Develop data pipelines that process and filter trillions of tokens for pre-training
Research and implement architectural improvements scaling laws and training optimizations
Debug training instabilities loss spikes and convergence issues in long-running jobs
Build tooling for cluster utilization fault tolerance and checkpoint management
Write custom CUDA/Triton kernels to optimize critical training operations (attention normalization activations)
Collaborate on research that advances the state of the art in foundation model training

Were looking for

Deep experience pre-training or post-training foundation models on large clusters
Expert-level at Python and ML frameworks (PyTorch JAX Torchtitan)
Strong systems skills: distributed training FSDP/ZeRO tensor parallelism pipeline parallelism
Experience writing performant CUDA or Triton kernels for ML workloads
Track record of running stable multi-week training jobs and debugging distributed training failures
Understanding of cluster scheduling networking bottlenecks and GPU/TPU performance optimization

Preferred Experience

Trained foundation models at major AI labs (OpenAI Anthropic Google DeepMind Meta xAI etc.)
Worked on large scale RL runs
Optimized critical training kernels (FlashAttention fused optimizers custom kernels)
Published research at top ML conferences (NeurIPS ICML ICLR)
Contributions to open source ML infrastructure (PyTorch JAX vLLM etc.)
Experience with training data pipelines data quality research or synthetic data generation

Life at Tzafon