About the Role
Youve likely scaled systems most people only read about.
Were entering a phase where infrastructure decisions directly shape model capability not just cost or reliability. The next gains wont come from brute force alone but from better systems sharper tradeoffs and sound judgment under real constraints.
As VP AI Engineering (Pre-Training) youll define how large-scale training actually works here from compute strategy and data movement to iteration speed. This role is for someone who has been close to the metal seen what breaks at scale and knows how to design systems that hold up under pressure.
Infrastructure in this role isnt a support function. Its a first-order driver of model outcomes. Youll be accountable for whether training runs succeed stall or fail and for the decisions behind those outcomes.
This is a role for builders who have owned training systems end-to-end not for abstract platform or governance leadership.
What You Will Lead
You will own the entire pre-training foundation including:
Large-scale compute strategy (GPU clusters in the thousands)
Training orchestration throughput and efficiency
PB-scale data pipelines storage and data movement
Reliability and performance of long-running training workloads
Infrastructure decisions that directly impact model velocity and quality
This is not a keep the lights on role. Youll be expected to:
Make hard architectural calls with incomplete information
Decide where to invest where to simplify and where to say no
Work closely with modeling and post-training leaders to remove system-level constraints
Stay close enough to the systems to know when theory diverges from reality
Many of the decisions youll make here are difficult to reverse. Youll commit to architectures tooling and operating models that must hold up in production training runs measured in weeks not just in design documents.
Youll build a lean team around you but this role succeeds through clarity and leverage not headcount.
What You Bring
Youve likely done several of the following already:
Designed operated or materially evolved large GPU training clusters where you were accountable for throughput failure modes and iteration speed not just budget or vendor relationships
Built or scaled training infrastructure used by advanced ML teams in production
Worked with data pipelines measured in petabytes not terabytes
Balanced speed cost and reliability under real delivery pressure
Made architectural decisions that held up months or years later
Youve been close enough to the system to know where theoretical efficiency breaks down in practice and youve adjusted accordingly.
We care less about where you worked and more about what you personally owned especially when things didnt go to plan.
Why This Is Different
Fewer layers between decision and execution
Direct influence on model capability not just infrastructure metrics
A system still early enough to bend but serious enough to matter
A founder-led environment where clarity speed and judgment outweigh process
Meaningful equity and real ownership not cosmetic leadership
If youve ever wanted to apply everything youve learned without the inertia of a massive organization this is that moment.
What You Can Expect from Us
Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through
JobRad Bike Lease
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
Required Experience:
Exec
About the RoleYouve likely scaled systems most people only read about.Were entering a phase where infrastructure decisions directly shape model capability not just cost or reliability. The next gains wont come from brute force alone but from better systems sharper tradeoffs and sound judgment under...
About the Role
Youve likely scaled systems most people only read about.
Were entering a phase where infrastructure decisions directly shape model capability not just cost or reliability. The next gains wont come from brute force alone but from better systems sharper tradeoffs and sound judgment under real constraints.
As VP AI Engineering (Pre-Training) youll define how large-scale training actually works here from compute strategy and data movement to iteration speed. This role is for someone who has been close to the metal seen what breaks at scale and knows how to design systems that hold up under pressure.
Infrastructure in this role isnt a support function. Its a first-order driver of model outcomes. Youll be accountable for whether training runs succeed stall or fail and for the decisions behind those outcomes.
This is a role for builders who have owned training systems end-to-end not for abstract platform or governance leadership.
What You Will Lead
You will own the entire pre-training foundation including:
Large-scale compute strategy (GPU clusters in the thousands)
Training orchestration throughput and efficiency
PB-scale data pipelines storage and data movement
Reliability and performance of long-running training workloads
Infrastructure decisions that directly impact model velocity and quality
This is not a keep the lights on role. Youll be expected to:
Make hard architectural calls with incomplete information
Decide where to invest where to simplify and where to say no
Work closely with modeling and post-training leaders to remove system-level constraints
Stay close enough to the systems to know when theory diverges from reality
Many of the decisions youll make here are difficult to reverse. Youll commit to architectures tooling and operating models that must hold up in production training runs measured in weeks not just in design documents.
Youll build a lean team around you but this role succeeds through clarity and leverage not headcount.
What You Bring
Youve likely done several of the following already:
Designed operated or materially evolved large GPU training clusters where you were accountable for throughput failure modes and iteration speed not just budget or vendor relationships
Built or scaled training infrastructure used by advanced ML teams in production
Worked with data pipelines measured in petabytes not terabytes
Balanced speed cost and reliability under real delivery pressure
Made architectural decisions that held up months or years later
Youve been close enough to the system to know where theoretical efficiency breaks down in practice and youve adjusted accordingly.
We care less about where you worked and more about what you personally owned especially when things didnt go to plan.
Why This Is Different
Fewer layers between decision and execution
Direct influence on model capability not just infrastructure metrics
A system still early enough to bend but serious enough to matter
A founder-led environment where clarity speed and judgment outweigh process
Meaningful equity and real ownership not cosmetic leadership
If youve ever wanted to apply everything youve learned without the inertia of a massive organization this is that moment.
What You Can Expect from Us
Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through
JobRad Bike Lease
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
Required Experience:
Exec
View more
View less