Senior AI Engineer – Pre-training Data (fmd)
Heidelberg - Germany
Job Summary
Our Mission
Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance manufacturing public administration - need models that understand German meet European regulatory requirements and work reliably in high-stakes settings. Were building that in Heidelberg.
Were growing our pre-training team and hiring someone to passionately work on data: defining what goes into our models building the systems that source and prepare it and ensuring our training team has the highest-quality data to push model capabilities forward.
Team culture
At Aleph Alpha we foster a culture built on ownership autonomy and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organizational structure with efficient supportive management that enables quick decisionmaking open communication and a strong sense of shared purpose.
About the Role
As a Senior AI Engineer in Pre-training Data you will work across the full stack of data preparation - from sourcing and acquisition to processing filtering and mixture design. Some weeks youll be deep in data quality analysis understanding what makes a corpus valuable and how its composition affects downstream performance on public and bespoke evaluation tasks. Other weeks youll be optimising large-scale processing pipelines or building tooling that gives the team visibility into what our models are actually training on. And some weeks youll be reading the latest research on pre-training data methods translating findings into experiments you can run against our stack.
We approach data work in an evidence-based way. Decisions about filtering strategies data mixtures and quality thresholds are backed by ablations - youll design and run targeted experiments to validate that your data choices actually improve model outcomes.
We are looking for someone that combines significant research experience (in industry or academia) with high engineering competence.
Your work sits at high leverage: the data you source curate and synthesize directly determines what our models learn how well they perform and where they fall short. Youll have direct influence on the models we ship.
Your Responsibilities
Co-Own data pipelines end-to-end: Design build and maintain the infrastructure that sources processes deduplicates filters and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains languages quality tiers and licensing requirements to maximise model capability.
Build data quality tooling: Develop classifiers heuristics and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage then source or generate the data needed to address them.
Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning alignment and deployment goals - data choices upstream shape whats possible downstream.
Co-Own German-language data: Ensure deep high-quality coverage of German-language corpora - this is core to our value proposition not an afterthought.
Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition filtering or sourcing affect pre-training evaluation metrics and downstream capabilities.
Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.
Your Profile
Basic Qualifications
Track record of shipping impactful technical work - whether thats research infrastructure or both.
Strong Python skills and comfort with data engineering and ML infrastructure including experience with deep learning frameworks workflow orchestration object storage columnar data formats and distributed processing.
Ability to reason about what a dataset contributes to model training and whether it matters - not just process data but understand it.
Ownership mentality: you see problems through from diagnosis to solution to deployment.
Willingness to relocate to Heidelberg or travel at least fortnightly.
Preferred Qualifications
Experience with large-scale data processing for ML including corpus sourcing curation cleaning deduplication and filtering.
Familiarity with data quality methods: classifier-based filtering heuristic scoring perplexity-based selection and decontamination.
Understanding of foundation model training - how data composition scale and mixing ratios affect capabilities.
Experience with web-scale data sourcing and crawl processing (e.g. Common Crawl WARC pipelines).
Rust proficiency (parts of our data pipeline are performance-critical).
Infrastructure knowledge - experience with Kubernetes container orchestration or cloud-native ML infrastructure.
PhD in machine learning NLP data engineering or a related field (valued but not required - we care about what you can do).
Bonus but not required: German language proficiency can be helpful for curating and assessing German-language data.
Compensation and Benefits
Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
Required Experience:
Senior IC
About Company
Pioneering sovereign, European AI technology to transform human-machine interaction that can find solutions for the challenges of tomorrow.