The role
Youll design build and productionize AI features end-to-endmodel selection/finetuning retrieval pipelines evaluations and the serving layerworking closely with product and platform teams. If you enjoy turning ambiguous problems into shipping systems youll feel at home here.
What youll do
Own AI feature delivery from prototype production (design implement evaluate iterate).
Build RAG pipelines (chunking embeddings vector stores) prompt/program orchestration and guardrails.
Fine-tune and/or distill models (open/closed source) for classification generation and tool-use.
Implement robust offline & online evals (unit evals golden sets regression tests user-feedback loops).
Ship reliable services: APIs workers model servers and monitoring/observability (latency cost quality).
Partner with product/design to shape problem statements success metrics and experiment plans.
Champion engineering best practices (reviews testing docs incident learnings).
Requirements
Tech you might use here
Languages: Python TypeScript/Node.
AI/ML: PyTorch Hugging Face OpenAI/Anthropic/other LLM APIs vLLM/TensorRT-LLM LangChain/LlamaIndex (pragmatically).
Data & Retrieval: Postgres Redis Milvus/pgvector/Weaviate Kafka.
Infra: Docker Kubernetes CI/CD Grafana/Prometheus cloud (AWS/GCP).
Quality: Prompt/unit tests offline eval harnesses canary analysis A/B testing.
Were looking for
37 years of software engineering experience with 13 in applied ML/LLM or search/retrieval.
Strong Python engineering (typing testing packaging) and service design (APIs queues retries idempotency).
Hands-on with at least two of: RAG in prod finetuning (LoRA/QLoRA) embeddings/annoy/hnsw function/tool calling or model serving at scale.
Practical evaluation mindset: create golden datasets design metrics (accuracy faithfulness toxicity latency cost).
Product sense and ownership: you measure impact not just model scores.
Clear communication and collaborative habits (PRs design docs incident notes).
Nice to have
Experience with multi-tenant architectures RBAC/ABAC and data governance.
Safety & reliability work (red-teaming jailbreak defenses PII handling).
Frontend familiarity (React) to iterate quickly on UX for AI features.
Prior startup experience or 01 product building.
What success looks like (first 90 days)
Ship a scoped AI feature into customer hands with an eval harness and dashboards.
Reduce either latency or cost of an existing pipeline by 2030% without quality loss.
Add at least one reusable internal component (chunker ranker guardrail eval set).
Interview process
Intro chat (30 min): role fit & expectations.
Technical deep-dive (60 min): systems ML/LLM problem solving.
Practical exercise (take-home or pairing 34 hrs): build a small RAG/eval pipeline.
Final loop (6090 min): product & culture past work offer Q&A.
Example exercise (high-level brief)
Build a minimal retrieval-augmented QA service over a small doc set. Include: chunking strategy embedding store answer generation and an eval set (1020 Q/A) with simple metrics (EM/F1/faithfulness). Provide a short readme with trade-offs and cost/latency numbers.
Tech you might use here Languages: Python TypeScript/Node. AI/ML: PyTorch Hugging Face OpenAI/Anthropic/other LLM APIs vLLM/TensorRT-LLM LangChain/LlamaIndex (pragmatically). Data & Retrieval: Postgres Redis Milvus/pgvector/Weaviate Kafka. Infra: Docker Kubernetes CI/CD Grafana/Prometheus cloud (AWS/GCP). Quality: Prompt/unit tests offline eval harnesses canary analysis A/B testing.