Senior AI Researcher Pre-training Data (mfd)

Aleph Alpha

Job Location:

Heidelberg - Germany

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance manufacturing public administration - need models that understand German meet European regulatory requirements and work reliably in high-stakes settings. Were building that in Heidelberg.

Were growing our pre-training team and hiring someone to passionately work on data: defining what goes into our models building the systems that source and prepare it and ensuring our training team has the highest-quality data to push model capabilities forward.

Team Culture

At Aleph Alpha we foster a culture built on ownership autonomy and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organisational structure with efficient supportive management that enables quick decisionmaking open communication and a strong sense of shared purpose.

About the role

As a Senior AI Researcher for Pre-training Data you will shape and improve the underlying scientific methodology behind our pre-training corpora while also co-engineering the software and systems that enable this. Working with engineers and other researchers to build scalable pipelines you will focus on relevant theoretical and empirical research required to understand which data makes models perform best on our targeted capabilities.

This role is for you if you have a strong background in large-scale language modeling and the scientific drive to answer complex questions about data scaling laws synthetic data generation and curriculum learning.

In your day-to-day you will design targeted ablations across various scales derive and test hypotheses from training dynamics develop novel algorithms for estimating data quality and performing data curation and contribute to a range of engineering tasks which facilitate these research directions. Together with a collaborative team of engineers and researchers you will have a direct impact on the fundamental knowledge and capabilities of the models we ship. You will also help or lead the writing of technical reports for internal and external readers as well as presenting at and contributing to technical meetings and conferences on an as-needed basis.

Your Responsibilities

Innovation in Data-Centric AI: Stay at the bleeding edge of foundation model research. You will identify implement and iterate on novel approaches to estimating data quality synthetic data generation curriculum learning and advanced curation techniques.
Data-to-Performance Science: Design and lead rigorous ablation studies across various scales. You will systematically analyse how changes in data composition deduplication strategies heuristic and model-based curation and scaling laws affect training dynamics and target model and system capabilities.
Develop Novel Quality Signals: Move beyond basic perplexity filtering. Research and build advanced algorithms to score and select data such as influence functions gradient-based matching or using smaller models to curate data for larger ones.
Cross-Functional Collaboration: Partner closely with a diverse team to scale your research from prototypes to trillions-of-tokens-scale pipelines and work with the post-training team to ensure pre-training distributions effectively support targeted fine-tuning and customer-alignment.

Your Profile

Basic Qualifications

A deep understanding of machine learning theory specifically regarding foundation model training dynamics scaling laws and data-centric AI.
Experience designing and evaluating complex ML experiments related to data composition curriculum learning or data quality on language model training.
Familiarity with statistical methods for evaluation and experiment design.
Ability to reason about the information-theoretic properties of a dataset and its predictive power for evaluated tasks: not just processing data but understanding its signal.
Strong Python skills and comfort with ML tooling and deep learning frameworks (especially PyTorch).
Willingness to relocate to Heidelberg or travel at least fortnightly.

Preferred Qualifications

PhD in machine learning NLP or equivalent research experience focusing on large-scale language modeling or data curation.
A history of contributions to top-tier venues (NeurIPS ICML ICLR ACL etc.) specifically regarding data curation scaling laws synthetic data or LLM pre-training.
Experience training foundation models from scratch and diagnosing data-induced training pathologies.
Bonus but not required: German language proficiency can be helpful for curating and assessing German-language data.

Compensation and Benefits

Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
JobRad Bike Lease

Required Experience:

Senior IC

Our MissionAleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance manufacturing public administration - need models that understand German meet European regulatory requirements and work reliably in high-stakes settings. Were building...

Our Mission

Your Responsibilities

Innovation in Data-Centric AI: Stay at the bleeding edge of foundation model research. You will identify implement and iterate on novel approaches to estimating data quality synthetic data generation curriculum learning and advanced curation techniques.
Data-to-Performance Science: Design and lead rigorous ablation studies across various scales. You will systematically analyse how changes in data composition deduplication strategies heuristic and model-based curation and scaling laws affect training dynamics and target model and system capabilities.
Develop Novel Quality Signals: Move beyond basic perplexity filtering. Research and build advanced algorithms to score and select data such as influence functions gradient-based matching or using smaller models to curate data for larger ones.
Cross-Functional Collaboration: Partner closely with a diverse team to scale your research from prototypes to trillions-of-tokens-scale pipelines and work with the post-training team to ensure pre-training distributions effectively support targeted fine-tuning and customer-alignment.

Your Profile

Basic Qualifications

A deep understanding of machine learning theory specifically regarding foundation model training dynamics scaling laws and data-centric AI.
Experience designing and evaluating complex ML experiments related to data composition curriculum learning or data quality on language model training.
Familiarity with statistical methods for evaluation and experiment design.
Ability to reason about the information-theoretic properties of a dataset and its predictive power for evaluated tasks: not just processing data but understanding its signal.
Strong Python skills and comfort with ML tooling and deep learning frameworks (especially PyTorch).
Willingness to relocate to Heidelberg or travel at least fortnightly.

Preferred Qualifications

PhD in machine learning NLP or equivalent research experience focusing on large-scale language modeling or data curation.
A history of contributions to top-tier venues (NeurIPS ICML ICLR ACL etc.) specifically regarding data curation scaling laws synthetic data or LLM pre-training.
Experience training foundation models from scratch and diagnosing data-induced training pathologies.
Bonus but not required: German language proficiency can be helpful for curating and assessing German-language data.