About Us
Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity privacy and financial audit. Put simply we build software for the people who enable trust between businesses.
Were based in San Francisco CA but built as a remote-first company that enables you to do your best work from anywhere. Were backed by top investors including Growth Equity at Goldman Sachs Alternatives Bessemer Venture Partners 8VC Floodgate Y Combinator DNX Ventures Global Founders Capital Justin Kan Elad Gil and more.
We value diversity in backgrounds and in experiences. We need people from all backgrounds and walks of life to help build the future of audit and advisory. Fieldguides team is inclusive driven humble and supportive. We are deliberate and self-reflective about the kind of team and culture that we are building seeking teammates that are not only strong in their own aptitudes but care deeply about supporting each others growth.
As an early stage start-up employee youll have the opportunity to build out the future of business trust. We make audit practitioners lives easier by bringing together up to 50% of their work and giving them better work-life balance. If you share our values and enthusiasm for building a great culture and product you will find a home at Fieldguide.
About the Role
Fieldguide is building AI agents for the most complex audit and advisory workflows. Were a San Francisco-based Vertical AI company building in a $100B market undergoing rapid transformation. Over 50 of the top 100 accounting and consulting firms trust us to power their most mission-critical work. Were backed by Bessemer Venture Partners 8VC Floodgate Y Combinator Elad Gil and other top-tier investors.
As an AI Engineer Quality you will own the evaluation infrastructure that ensures our AI agents perform reliably at enterprise scale. This role is 100% focused on making evaluations a first-class engineering capability: building the unified platform automated pipelines and production feedback loops that let us evaluate any new model against all critical workflows within hours. Youll work at the intersection of ML engineering observability and quality assurance to ensure our agents meet the rigorous standards our customers demand.
Were hiring across all levels. Well calibrate seniority during interviews based on your background and what youre looking to own. This role is for engineers who value in-person collaboration at our San Francisco CA office.
What Youll Own
Measurable AI Agents
Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
Build observability systems that surface agent behavior trace execution and failure modes in production and feedback loops that turn production failures into first-class evaluation cases
Own the evaluation infrastructure stack including integration with LangSmith and LangGraph.
Translate customer problems into concrete agent behaviors and workflows
Integrate and orchestrate LLMs tools retrieval systems and logic into cohesive reliable agent experiences
Rapid Model Evaluation
Build automated pipelines that evaluate new models against all critical workflows within hours of release
Design evaluation harnesses for our most complex Agentic systems and workflows
Implement comparison frameworks that measure effectiveness consistency latency and cost across model versions
Design guardrails and monitoring systems that catch quality regressions before they reach customers
AI-native engineering execution
Use AI as core leverage in how you design build test and iterate
Prototype quickly to resolve uncertainty then harden systems for enterprise-grade reliability
Build evaluations feedback mechanisms and guardrails so agents improve over time
Work with SMEs and ML Engineers to create evaluation datasets by curating production traces.
Design prompts retrieval pipelines and agent orchestration systems that perform reliably at scale
Ownership of Quality and Large Product Areas
Define and document evaluation standards best practices and processes for the engineering organization
Advocate for evaluation-driven development and make it easy for the team to write and run evals
Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
Take full ownership of large product areas rather than executing on narrow tasks
Who You Are
You are an engineer who believes that evaluations are foundational to building reliable AI systems not a nice-to-have. The following operating principles should resonate with you:
Evaluation-first mindset: You understand that for an AI company not being able to evaluate a new model quickly is unacceptable
AI-native instincts: You treat LLMs agents and automation as fundamental building blocks and parts of the craft of engineering
Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters
Production-oriented: You understand that evaluations must work on real production behavior not just offline datasets
Strong product judgment: You can decide what matters and why without waiting for guidance not just how to implement it
Bias to building: You move fast and build working systems rather than perfect specifications
Experience
We care more about capability and trajectory than years on a resume but most strong candidates will have:
Multiple years of experience shipping production software in complex real-world systems
Experience with TypeScript React Python and Postgres
Built and deployed LLM-powered features serving production traffic
Implemented evaluation frameworks for model outputs and agent behaviors
Designed observability or tracing infrastructure for AI/ML systems
Worked with vector databases embedding models and RAG architectures
Experience with evaluation platforms (LangSmith Langfuse or similar)
Comfort operating in ambiguity and taking responsibility for outcomes
Deep empathy for professional-grade mission-critical software (experience with audit and accounting workflows are not required)
What Should Excite You
Agent reliability at enterprise scale: Building systems that professionals depend on
Balancing automation with human oversight: Knowing when to automate and when to surface decisions to experts
Production feedback loops: Turning real-world agent failures into systematic improvements
Explaining AI decisions: Making all forms of AI outputs and agent reasoning transparent and trustworthy
Evaluation for nuanced domains: Structuring data and feedback for workflows where ground truth requires expert judgment
High-impact visibility: Your work directly enables leadership to confidently communicate AI quality to the board and customers
Required Experience:
IC
About UsFieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity privacy and financial audit. Put simply we build software for the people who enable trus...
About Us
Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity privacy and financial audit. Put simply we build software for the people who enable trust between businesses.
Were based in San Francisco CA but built as a remote-first company that enables you to do your best work from anywhere. Were backed by top investors including Growth Equity at Goldman Sachs Alternatives Bessemer Venture Partners 8VC Floodgate Y Combinator DNX Ventures Global Founders Capital Justin Kan Elad Gil and more.
We value diversity in backgrounds and in experiences. We need people from all backgrounds and walks of life to help build the future of audit and advisory. Fieldguides team is inclusive driven humble and supportive. We are deliberate and self-reflective about the kind of team and culture that we are building seeking teammates that are not only strong in their own aptitudes but care deeply about supporting each others growth.
As an early stage start-up employee youll have the opportunity to build out the future of business trust. We make audit practitioners lives easier by bringing together up to 50% of their work and giving them better work-life balance. If you share our values and enthusiasm for building a great culture and product you will find a home at Fieldguide.
About the Role
Fieldguide is building AI agents for the most complex audit and advisory workflows. Were a San Francisco-based Vertical AI company building in a $100B market undergoing rapid transformation. Over 50 of the top 100 accounting and consulting firms trust us to power their most mission-critical work. Were backed by Bessemer Venture Partners 8VC Floodgate Y Combinator Elad Gil and other top-tier investors.
As an AI Engineer Quality you will own the evaluation infrastructure that ensures our AI agents perform reliably at enterprise scale. This role is 100% focused on making evaluations a first-class engineering capability: building the unified platform automated pipelines and production feedback loops that let us evaluate any new model against all critical workflows within hours. Youll work at the intersection of ML engineering observability and quality assurance to ensure our agents meet the rigorous standards our customers demand.
Were hiring across all levels. Well calibrate seniority during interviews based on your background and what youre looking to own. This role is for engineers who value in-person collaboration at our San Francisco CA office.
What Youll Own
Measurable AI Agents
Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
Build observability systems that surface agent behavior trace execution and failure modes in production and feedback loops that turn production failures into first-class evaluation cases
Own the evaluation infrastructure stack including integration with LangSmith and LangGraph.
Translate customer problems into concrete agent behaviors and workflows
Integrate and orchestrate LLMs tools retrieval systems and logic into cohesive reliable agent experiences
Rapid Model Evaluation
Build automated pipelines that evaluate new models against all critical workflows within hours of release
Design evaluation harnesses for our most complex Agentic systems and workflows
Implement comparison frameworks that measure effectiveness consistency latency and cost across model versions
Design guardrails and monitoring systems that catch quality regressions before they reach customers
AI-native engineering execution
Use AI as core leverage in how you design build test and iterate
Prototype quickly to resolve uncertainty then harden systems for enterprise-grade reliability
Build evaluations feedback mechanisms and guardrails so agents improve over time
Work with SMEs and ML Engineers to create evaluation datasets by curating production traces.
Design prompts retrieval pipelines and agent orchestration systems that perform reliably at scale
Ownership of Quality and Large Product Areas
Define and document evaluation standards best practices and processes for the engineering organization
Advocate for evaluation-driven development and make it easy for the team to write and run evals
Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
Take full ownership of large product areas rather than executing on narrow tasks
Who You Are
You are an engineer who believes that evaluations are foundational to building reliable AI systems not a nice-to-have. The following operating principles should resonate with you:
Evaluation-first mindset: You understand that for an AI company not being able to evaluate a new model quickly is unacceptable
AI-native instincts: You treat LLMs agents and automation as fundamental building blocks and parts of the craft of engineering
Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters
Production-oriented: You understand that evaluations must work on real production behavior not just offline datasets
Strong product judgment: You can decide what matters and why without waiting for guidance not just how to implement it
Bias to building: You move fast and build working systems rather than perfect specifications
Experience
We care more about capability and trajectory than years on a resume but most strong candidates will have:
Multiple years of experience shipping production software in complex real-world systems
Experience with TypeScript React Python and Postgres
Built and deployed LLM-powered features serving production traffic
Implemented evaluation frameworks for model outputs and agent behaviors
Designed observability or tracing infrastructure for AI/ML systems
Worked with vector databases embedding models and RAG architectures
Experience with evaluation platforms (LangSmith Langfuse or similar)
Comfort operating in ambiguity and taking responsibility for outcomes
Deep empathy for professional-grade mission-critical software (experience with audit and accounting workflows are not required)
What Should Excite You
Agent reliability at enterprise scale: Building systems that professionals depend on
Balancing automation with human oversight: Knowing when to automate and when to surface decisions to experts
Production feedback loops: Turning real-world agent failures into systematic improvements
Explaining AI decisions: Making all forms of AI outputs and agent reasoning transparent and trustworthy
Evaluation for nuanced domains: Structuring data and feedback for workflows where ground truth requires expert judgment
High-impact visibility: Your work directly enables leadership to confidently communicate AI quality to the board and customers
Required Experience:
IC
View more
View less