AI Evaluation Scientist
McLean, MD - USA
Job Summary
Overview
We are looking for anAI Evaluation Scientistto design and execute evaluation processes that ensure our predictive and generative AI systems areaccurate reliable safe and aligned with mission requirements. This role is essential forestablishingtrust in AI solutions and supporting continuous improvement across the AI lifecycle. The AI Evaluation Scientist will work closely with engineers data scientists governance analysts and product teams to develop evaluation metrics build test harnesses analyze model behavior and support responsible deployment.
Contributions
- Implement evaluation frameworks for AI models including accuracy robustness relevance bias hallucination rate and safety metrics.
- Build andmaintainautomated evaluation scripts tests and pipelines that assess AI model outputs and detect performance drift over time.
- Develop benchmark datasets challenge sets and scenario-based test cases tailored to mission and user needs.
- Perform structured error analysis and behavioral audits of LLMs retrieval-augmented generation (RAG) systems and predictive models documentingfindingsand improvement recommendations.
- Collaborate with AI DevelopersLLMOpsEngineers and Data Scientists to support iterative experimentation model hardening and quality improvements.
- Contribute to the design of human-in-the-loop evaluation workflows integrating qualitative and quantitative insight into evaluation reports.
- Assistin mapping evaluation outcomes to responsible AI principles such as fairness transparency reliability and safety.
- Partner with AI Governance Analysts to ensure evaluation outputs support compliance documentation and risk assessments.
- Stay current with emerging evaluation tools frameworks metrics and research related to LLM assessment and generative AI reliability.
- Document evaluation processes criteria and results for both technical and non-technical audiences.
- You will contribute to the growth of our AI & Data Exploitation Practice!
Qualifications
- Ability to hold aposition of public trustwith the U.S. government.
- Bachelors orMasters degreeinComputer Science Statistics Machine Learning Cognitive Science Human-Computer Interaction Data Science ora relatedfield.
- 2 yearsof experience evaluating machine learning models NLP systems or generative AI models (LLMs preferred).
- Familiarity withevaluationmetrics statistical testing dataset creation and experimental design for AI systems.
- Proficiencyin Python and relevant libraries such asPyTorch Hugging Face scikit-learn LangChain.
- Proficiency in AI evaluation frameworks such as Ragas.
- Experience analyzing structured and unstructured data including text documents and embeddings.
- Understanding ofLLM behavior prompt evaluation retrieval pipelines or RAG architectures.
- Exposure to responsible AI concepts and governance-aligned evaluation criteria (e.g. fairness transparency reliability).
- Strong analytical skills with the ability to interpret model weaknesses extract insights and recommend actionable improvements.
- Excellent written and verbal communication skills with the ability to present evaluation findings clearly to technical and non-technical stakeholders.
- Experience working in agile or iterative development environments is a plus.
- Familiarity with OWASP LLM Top 10 Risks.
- NIH experience.
- Relevant certifications (helpful but not required):
- NIST AI RMF (AISIC)
- INFORMS CAP
- AWS/Azure/Google ML Certifications.
- Local to Washington DC metro area preferred.
About steampunk
Steampunk relies on several factors to determine salary including but not limited to geographic location contractual requirements education knowledge skills competencies and experience. The projected compensation range for this position is $105000 to $145000. The estimate displayed represents a typical annual salary range for this position. Annual salary is just one aspect of Steampunks total compensation package for employees. Learn more about additional Steampunk benefits here.
Identity Statement
As part of the application process you are expected to be on camera during interviews and assessments. We reserve the right to take your picture to verify your identity and prevent fraud.
Steampunk is a Change Agent in the Federal contracting industry bringing new thinking to clients in the Homeland Federal Civilian Health and DoD sectors. Through our Human-Centered delivery methodology we are fundamentally changing the expectations our Federal clients have for true shared accountability in solving their toughest mission challenges. As an employee owned company we focus on investing in our employees to enable them to do the greatest work of their careers and rewarding them for outstanding contributions to our growth. If you want to learn more about our story visit .
Required Experience:
IC
About Company
Federal government clients at the center of everything we design, develop, and deliver to drive game-changing mission impacts.