In this role you will: - Design and analyze human evaluations of AI systems to create reliable annotation frameworks and ensure validity and reliability of measurements of latent constructs- Develop and refine benchmarks and evaluation protocols using statistical modeling test theory and task design to capture model performance across diverse contexts and user needs- Conduct statistical analysis of evaluation data to extract meaningful insights identify systematic issues and inform improvements to both models and evaluation processes- Analyze model behavior identify weaknesses and drive design decisions with failure analysis. Examples include but not limited to: model experimentation adversarial testing counterfactual analysis creating tools to assess model behavior and user impact- Collaborate with engineers to translate evaluation methods and analysis techniques into scalable adaptable and reliable solutions that can be reused across different features use cases and evaluation workflows- Work cross-functionally to apply methods to real-world applications with designers clinical experts and engineering teams across Hardware and Software- Independently run and analyze experiments for real improvements
- BS and a minimum of 10 years relevant industry experience in a empirical field with emphasis on quantitative methodologies of human behavior including HCI Psychometrics Quantitative or Experimental Psychology Educational Measurement Language Assessment or a relevant field
- Proficiency in Python and ability to write clean performant code and collaborate using standard software development practices (e.g. Git)
- Strong statistical analysis skills and experience in crafting experiments validating data quality and model performance
- Experience in building and extending data and inference pipelines to process large scale datasets
- MS or PhD or equivalent experience in relevant fields
- Real-world experience with LLM-based evaluation systems and human annotation and human evaluation methodologies
- Experience in rigorous evidence-based approaches to test development e.g. quantitative and qualitative test design reliability and validity analysis
- Customer-focused mindset with experience or strong interest in building consumer digital health and wellness products
- Strong communication skills and ability to work cross-functionally with technical and non-technical stakeholders
In this role you will: - Design and analyze human evaluations of AI systems to create reliable annotation frameworks and ensure validity and reliability of measurements of latent constructs- Develop and refine benchmarks and evaluation protocols using statistical modeling test theory and task desi...
In this role you will: - Design and analyze human evaluations of AI systems to create reliable annotation frameworks and ensure validity and reliability of measurements of latent constructs- Develop and refine benchmarks and evaluation protocols using statistical modeling test theory and task design to capture model performance across diverse contexts and user needs- Conduct statistical analysis of evaluation data to extract meaningful insights identify systematic issues and inform improvements to both models and evaluation processes- Analyze model behavior identify weaknesses and drive design decisions with failure analysis. Examples include but not limited to: model experimentation adversarial testing counterfactual analysis creating tools to assess model behavior and user impact- Collaborate with engineers to translate evaluation methods and analysis techniques into scalable adaptable and reliable solutions that can be reused across different features use cases and evaluation workflows- Work cross-functionally to apply methods to real-world applications with designers clinical experts and engineering teams across Hardware and Software- Independently run and analyze experiments for real improvements
- BS and a minimum of 10 years relevant industry experience in a empirical field with emphasis on quantitative methodologies of human behavior including HCI Psychometrics Quantitative or Experimental Psychology Educational Measurement Language Assessment or a relevant field
- Proficiency in Python and ability to write clean performant code and collaborate using standard software development practices (e.g. Git)
- Strong statistical analysis skills and experience in crafting experiments validating data quality and model performance
- Experience in building and extending data and inference pipelines to process large scale datasets
- MS or PhD or equivalent experience in relevant fields
- Real-world experience with LLM-based evaluation systems and human annotation and human evaluation methodologies
- Experience in rigorous evidence-based approaches to test development e.g. quantitative and qualitative test design reliability and validity analysis
- Customer-focused mindset with experience or strong interest in building consumer digital health and wellness products
- Strong communication skills and ability to work cross-functionally with technical and non-technical stakeholders
View more
View less