drjobs AI Agent Reliability Engineer - Chaps

AI Agent Reliability Engineer - Chaps

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

London - UK

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

About Craft & Chaps

At Craft we rethink productivity from first principles. Our products disappear into the background so people can do their lifes workfast joyfully and without friction.

Chaps is our new AI-first product focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.


About the role

Our AI Product team is looking for an engineer who obsesses over making multi-agent systems robust observable and continuously improving. Youll build the test harnesses evaluation pipelines and monitoring layers that keep dozens of collaborating agents on-task on-budget and on-time.

In practice that means:

  • Designing automated evals that exercise complete agent workflowscatching regressions before they reach users.

  • Instrumenting every prompt tool-call and model hop with rich telemetry so we can trace root causes in minutes not days.

  • Creating feedback loops that turn logs user ratings and synthetic tests into better prompts and safer behaviors.

  • Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.

You will partner with product research and infra to ship an AI assistant users can trustno surprises no downtime.

What were looking for

You must have:

  • Hands-on experience with LLM evaluation frameworks (e.g. OpenAI Evals LangSmith LLM-Harness) and a track record of turning eval results into product-ready gating.

  • Observability chopsyouve wired up tracing/metrics for distributed systems (OpenTelemetry Prometheus Grafana) and know how to set SLOs that actually matter.

  • Prompt-engineering fluencyfew-shot function-calling RAG orchestrationand an instinct for spotting ambiguity or jailbreak vectors.

  • Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions Terraform Docker/K8s).

  • A bias for experimentation: you automate A/B tests costlatency trade-off studies and rollback safeguards as part of the dev cycle.

It would be great if you have:

  • Experience scaling multi-agent planners or tool-using agents in real products.

  • Familiarity with vector databases semantic diff tooling or RLHF/RLAIF pipelines.

  • A knack for weaving human feedback (support tickets thumbs-downs) into automated regression tests.

Our Culture

  • Think differently. We value novel ideas over legacy playbooksand we give you room to explore.

  • People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.

  • Pragmatic craftsmanship. We ship fast but we measure twicedata accuracy latency budgets and reliability all matter.

  • Clear communicators. You translate metrics into stories that product managers and designers understand sparking better decisions.

Join us if you want to make AI that worksevery request every time.


Required Experience:

Unclear Seniority

Employment Type

Full-Time

Company Industry

About Company

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.