AI Engineer for LLM Ops & Evaluation (mfd)

Munich - Germany

Monthly Salary: Not Disclosed

Posted on: 3 hours ago

Vacancies: 1 Vacancy

Job Summary

Our customers are auditors risk managers and compliance teams which means evaluation rigor auditability and EU AI Act readiness arent afterthoughts for us. Theyre product requirements.

Tasks

As our AI Engineer for LLMOps & Evaluation youll own the LLMOps pipeline end-to-end and work directly alongside our founding team.

You will:

Own the LLMOps pipeline: Evaluate infrastructure prompt optimization loop and the production integration that turns experiments into reliable customer-facing features
Design evaluation strategy per output type: Decide when to use deterministic evals (exact match schema validation embeddings) vs. LLM-as-judge and build the rubrics test datasets and human-review loops that make the system trustworthy
Drive prompt engineering and optimization across all LLM operations in the product: Moving from hand-tuned prompts to a measurable iterative process
Pick the right tool for each problem: Some things are LLM problems some are embedding classical NLP problems some are deterministic logic
Run the production side of AI features: Observability (Langfuse /LangSmith / similar) cost and latency engineering incident response when an LLM feature degrades
Build human-in-the-loop workflows: Review queues feedback ingestion labeling; so production signal feeds back into evals and prompt iteration
Mentor our AI & Analytics Intern and contribute to how we build the AI team over time

Requirements

3 years of hands-on experience building and shipping ML/AI systems in production (we care more about what youve shipped than years on a CV)
Have shipped an LLM evaluation or prompt optimization pipeline not just used LLMs in a project but owned the loop
Strong hands-on experience with LLM-as-judge including its variance problems and concrete techniques for controlling them
Solid foundation in classical NLP and ML ops: Embeddings semantic similarity entity matching classification fuzzy matching
Informed opinions on deterministic vs. LLM-based evals from experience
Production judgment: Youve owned cost and latency tradeoffs observability and incident response for an LLM-powered feature. Youre familiar with prompt regression and have strategies for managing it
Strong Python
Excellent English communication written and verbal: We discuss nuanced technical tradeoffs daily with the founding team and customers
Comfort with ambiguity: You can run experiments on real data build intuition for this domain and know when to stop iterating

Nice to have

Hands-on experience with LLM observability and eval tooling (Langfuse LangSmith Phoenix/Arize Helicone Braintrust W&B)
Experience with DSPy or similar prompt optimization frameworks and opinions on where they do and dont work
Experience with Azure OpenAI in EU regions or with EU-sovereign providers (Mistral Aleph Alpha)
Exposure to guardrails content safety or AI governance
Exposure to enterprise software ideally GRC compliance audit or regulated industries
Familiarity with Java/Spring Boot or Kubernetes on Azure; enough to integrate cleanly
German

Benefits

Hands-on ownership of a real AI product used by enterprise customers
Work directly alongside the founding team from day one
Hybrid work model: Munich North minimum one day per week in the office otherwise flexible (open to strong candidates elsewhere in the EU for the right fit); onboarding will take in-office
A steep learning curve at the intersection of LLM engineering enterprise GRC and startup operations
The chance to shape the AI team as we grow

Auxilius .ai is building AI-powered GRC solutions for enterprises. Were early-stage fast-growing and backed by real customers. Our tech stack includes Java & Spring Boot Angular Kubernetes on Azure and OpenAI & Anthropic LLMs.

Youll join an early-stage AI-native startup with a product that has already proven market fit. We build cutting-edge AI solutions for Governance Risk and Compliance (GRC) for enterprises around the world.Our customers are auditors risk managers and compliance teams which means evaluation rigor audit...