Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via emailAbout Craft & Chaps
At Craft we rethink productivity from first principles. Our products disappear into the background so people can do their lifes workfast joyfully and without friction.
Chaps is our new AI-first product focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.
About the role
Our AI Product team is looking for an engineer who obsesses over making multi-agent systems robust observable and continuously improving. Youll build the test harnesses evaluation pipelines and monitoring layers that keep dozens of collaborating agents on-task on-budget and on-time.
In practice that means:
Designing automated evals that exercise complete agent workflowscatching regressions before they reach users.
Instrumenting every prompt tool-call and model hop with rich telemetry so we can trace root causes in minutes not days.
Creating feedback loops that turn logs user ratings and synthetic tests into better prompts and safer behaviors.
Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.
You will partner with product research and infra to ship an AI assistant users can trustno surprises no downtime.
What were looking for
You must have:
Hands-on experience with LLM evaluation frameworks (e.g. OpenAI Evals LangSmith LLM-Harness) and a track record of turning eval results into product-ready gating.
Observability chopsyouve wired up tracing/metrics for distributed systems (OpenTelemetry Prometheus Grafana) and know how to set SLOs that actually matter.
Prompt-engineering fluencyfew-shot function-calling RAG orchestrationand an instinct for spotting ambiguity or jailbreak vectors.
Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions Terraform Docker/K8s).
A bias for experimentation: you automate A/B tests costlatency trade-off studies and rollback safeguards as part of the dev cycle.
It would be great if you have:
Experience scaling multi-agent planners or tool-using agents in real products.
Familiarity with vector databases semantic diff tooling or RLHF/RLAIF pipelines.
A knack for weaving human feedback (support tickets thumbs-downs) into automated regression tests.
Our Culture
Think differently. We value novel ideas over legacy playbooksand we give you room to explore.
People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.
Pragmatic craftsmanship. We ship fast but we measure twicedata accuracy latency budgets and reliability all matter.
Clear communicators. You translate metrics into stories that product managers and designers understand sparking better decisions.
Join us if you want to make AI that worksevery request every time.
Required Experience:
Unclear Seniority
Full-Time