About You
Do you get joy from turning messy legal texts into clean structured high-quality datasets that actually improve model behavior Do you like building pipelines where every step is measurable: extraction quality citation correctness dedup rate cost per item throughput and regression stability Are you comfortable shipping pragmatic tooling (CLIs validators tests) around LLMs without hand-waving away edge cases If so wed love to hear from you.
About Omnilex
Omnilex is a young dynamic AI legal tech startup with its roots at ETH Zurich. Our passionate interdisciplinary team of 10 people is dedicated to empowering legal professionals in law firms and legal teams by leveraging the power of AI for legal research and answering complex legal questions. We already stand out with handling unique challenges including our combination of external data customer-internal data and our own innovative AI-first legal commentaries.
Your Mission: Synthetic Data for Legal AI
As an AI Engineer Synthetic Data Generation you will build and own pipelines that generate retrieval-ready and evaluation-grade synthetic datasets from real legal sources (court decisions statutes commentaries) across languages and jurisdictions while keeping quality high and costs controlled.
Tasks
Your Responsibilities
- Build multi-step generation pipelines (10 steps): From DB selection pseudonymization extraction translation normalization deduplication validation classification rating export.
- LLM integration production-grade: Design robust prompt suites for extraction translation classification and rating; enforce structured JSON outputs; handle retries partial failures and weird model behavior.
- Quality assurance & filtering: Implement scoring systems (multi-criteria consistent rubrics) dedup/near-dup suppression and deterministic validators (especially for citations).
- Citation processing at legal-grade precision: Extract normalize and validate citations across languages and formats (e.g. Art. 336c Abs. 1 OR BGE 137 III 266 E. 3.2) including abbreviation mapping and normalization rules.
- Cost & throughput optimization: Use batch APIs where appropriate tune reasoning effort control concurrency count tokens and keep runs cost-efficient (without sacrificing quality).
- Developer tooling & CLI workflows: Build CLIs with progress tracking configurable concurrency and solid ergonomics for long-running jobs.
- Testing across levels: Write unit/smoke/integration tests for pipelines and validators (including mocked LLMs where sensible and real API runs where needed).
- Cross-team collaboration: Work closely with legal experts to define what good looks like for exam questions/commentaries and translate that into measurable QA checks.
Requirements
Minimum qualifications
- Experience building backend/data tooling with TypeScript/ (strict typing generics async patterns).
- Hands-on experience integrating LLM APIs (OpenAI/Anthropic or similar) including structured outputs (JSON) prompt iteration and failure handling.
- Strong data pipeline mindset: ETL workflows transformation steps validation and reproducibility.
- Solid SQL/PostgreSQL skills and experience with an ORM (bonus if Drizzle).
- Experience writing reliable tests (e.g. Jest) and maintaining CI-friendly pipelines.
- Fluent English; willing to work hybrid in Zurich (on-site at least two days/week) full-time.
Preferred qualifications
- Familiarity with the Swiss legal system (court structure citation norms multilingual legal terminology).
- Working proficiency in German; plus French/Italian is a strong advantage.
- Experience with batch processing and cost-aware LLM operations (token budgeting batching strategy caching early-exit).
- Practical text processing skills: regex-heavy parsing dedup/near-dup detection similarity search (e.g. BM25 / MiniSearch).
- Familiarity with our environment: Yarn workspaces/monorepos NestJS and pragmatic CLI tooling.
Benefits
Benefits
- Direct impact: Your datasets will directly shape model quality and evaluation reliability in legal research and reasoning.
- Autonomy & ownership: Own the synthetic data pipeline end-to-end; prompts validators QA exports and cost controls.
- Team: Work with a sharp interdisciplinary group at the intersection of AI engineering and law.
- Compensation: CHF per month ESOP depending on experience and skills.
Were excited to hear from candidates who love building robust cost-aware LLM pipelines and care about precision (especially when citations and multilingual legal nuance matter). Apply today by pressing the Apply button.
About YouDo you get joy from turning messy legal texts into clean structured high-quality datasets that actually improve model behavior Do you like building pipelines where every step is measurable: extraction quality citation correctness dedup rate cost per item throughput and regression stability...
About You
Do you get joy from turning messy legal texts into clean structured high-quality datasets that actually improve model behavior Do you like building pipelines where every step is measurable: extraction quality citation correctness dedup rate cost per item throughput and regression stability Are you comfortable shipping pragmatic tooling (CLIs validators tests) around LLMs without hand-waving away edge cases If so wed love to hear from you.
About Omnilex
Omnilex is a young dynamic AI legal tech startup with its roots at ETH Zurich. Our passionate interdisciplinary team of 10 people is dedicated to empowering legal professionals in law firms and legal teams by leveraging the power of AI for legal research and answering complex legal questions. We already stand out with handling unique challenges including our combination of external data customer-internal data and our own innovative AI-first legal commentaries.
Your Mission: Synthetic Data for Legal AI
As an AI Engineer Synthetic Data Generation you will build and own pipelines that generate retrieval-ready and evaluation-grade synthetic datasets from real legal sources (court decisions statutes commentaries) across languages and jurisdictions while keeping quality high and costs controlled.
Tasks
Your Responsibilities
- Build multi-step generation pipelines (10 steps): From DB selection pseudonymization extraction translation normalization deduplication validation classification rating export.
- LLM integration production-grade: Design robust prompt suites for extraction translation classification and rating; enforce structured JSON outputs; handle retries partial failures and weird model behavior.
- Quality assurance & filtering: Implement scoring systems (multi-criteria consistent rubrics) dedup/near-dup suppression and deterministic validators (especially for citations).
- Citation processing at legal-grade precision: Extract normalize and validate citations across languages and formats (e.g. Art. 336c Abs. 1 OR BGE 137 III 266 E. 3.2) including abbreviation mapping and normalization rules.
- Cost & throughput optimization: Use batch APIs where appropriate tune reasoning effort control concurrency count tokens and keep runs cost-efficient (without sacrificing quality).
- Developer tooling & CLI workflows: Build CLIs with progress tracking configurable concurrency and solid ergonomics for long-running jobs.
- Testing across levels: Write unit/smoke/integration tests for pipelines and validators (including mocked LLMs where sensible and real API runs where needed).
- Cross-team collaboration: Work closely with legal experts to define what good looks like for exam questions/commentaries and translate that into measurable QA checks.
Requirements
Minimum qualifications
- Experience building backend/data tooling with TypeScript/ (strict typing generics async patterns).
- Hands-on experience integrating LLM APIs (OpenAI/Anthropic or similar) including structured outputs (JSON) prompt iteration and failure handling.
- Strong data pipeline mindset: ETL workflows transformation steps validation and reproducibility.
- Solid SQL/PostgreSQL skills and experience with an ORM (bonus if Drizzle).
- Experience writing reliable tests (e.g. Jest) and maintaining CI-friendly pipelines.
- Fluent English; willing to work hybrid in Zurich (on-site at least two days/week) full-time.
Preferred qualifications
- Familiarity with the Swiss legal system (court structure citation norms multilingual legal terminology).
- Working proficiency in German; plus French/Italian is a strong advantage.
- Experience with batch processing and cost-aware LLM operations (token budgeting batching strategy caching early-exit).
- Practical text processing skills: regex-heavy parsing dedup/near-dup detection similarity search (e.g. BM25 / MiniSearch).
- Familiarity with our environment: Yarn workspaces/monorepos NestJS and pragmatic CLI tooling.
Benefits
Benefits
- Direct impact: Your datasets will directly shape model quality and evaluation reliability in legal research and reasoning.
- Autonomy & ownership: Own the synthetic data pipeline end-to-end; prompts validators QA exports and cost controls.
- Team: Work with a sharp interdisciplinary group at the intersection of AI engineering and law.
- Compensation: CHF per month ESOP depending on experience and skills.
Were excited to hear from candidates who love building robust cost-aware LLM pipelines and care about precision (especially when citations and multilingual legal nuance matter). Apply today by pressing the Apply button.
View more
View less