AIML Research Scientist, LLM Post-Training & Evaluation

Centific

Job Location:

Redmond, WA - USA

Monthly Salary: $ 150 - 180

Posted on: 14 hours ago

Vacancies: 1 Vacancy

Job Summary

About Centific

Centific is a frontier AI data foundry that curates diverse high-quality data using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe scalable AI deployment. Our team includes more than 150 PhDs and data scientists along with more than 4000 AI practitioners and engineers. We harness the power of an integrated solution ecosystemcomprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 marketsto create contextual multilingual pre-trained datasets; fine-tuned industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster.

Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.

About Job

Key Responsibilities

Evaluation Framework Development: Design and validate comprehensive evaluation frameworks for LLM and multimodal systems including benchmark and task design automated scoring methods model-assisted evaluation human annotation protocols and robustness testing across document types and modalities.
Multimodal Benchmark Research: Lead research into multimodal evaluation covering document understanding table QA image-text reasoning and OCR-grounded extraction tasks. Develop benchmarks that measure model performance on structured and unstructured data sources representative of real enterprise workloads.
Fine-Tuning & Post-Training Experiments: Design and execute supervised fine-tuning (SFT) and preference optimization experiments to improve model performance on targeted tasks. Analyze how training objectives dataset composition and evaluation design interact to drive measurable model improvement.
RAG & Agentic System Evaluation: Develop evaluation protocols for retrieval-augmented generation (RAG) systems and agentic LLM pipelines assessing retrieval quality answer relevance citation grounding and multi-step reasoning fidelity across production-grade workflows.
Data Pipeline & Annotation Design: Architect data collection annotation schema design and quality control workflows for training and evaluation corpora. Define annotation guidelines inter-rater agreement criteria and adjudication procedures; build tooling to support annotator interfaces and real-time metric monitoring.
Model Behavior Analysis: Analyze model failure patterns across tasks and domains; generate actionable recommendations for evaluation redesign and fine-tuning strategy. Translate findings into practical improvements for customer solutions and Centifics internal platforms.
Cloud-Native Evaluation Infrastructure: Collaborate with ML engineers to build scalable containerized evaluation and fine-tuning pipelines on cloud platforms (AWS GCP or Azure). Integrate monitoring logging and experiment tracking to support reproducible research workflows.
Cross-Functional Collaboration: Partner with Language Data Scientists ML engineers and product teams to integrate human-in-the-loop evaluation synthetic data strategies and automated benchmarking into platform-level pipelines.
Customer Engagement: Engage with technical stakeholders at leading AI organizations to understand evaluation goals review methodologies and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
Knowledge & IP Creation: Contribute to internal benchmark datasets reusable evaluation frameworks and research assets. Produce technical documentation research reports and client-facing materials explaining methods results assumptions and limitations.
Thought Leadership: Advance Centifics position in LLM evaluation and multimodal AI through publications conference presentations and open-source benchmark contributions.

Core Technical Competencies

You will provide technical depth and leadership across the following domains:

Evaluation Science & Benchmarking

Expert-level benchmark dataset and test suite design for language and multimodal models
Deep understanding of metric design scoring reliability and measurement validity
Experience with human evaluation and quality assurance (rubric design inter-rater reliability adjudication)
Familiarity with precision-recall analysis threshold tuning and annotation-driven quality loops

Multimodal & Document AI

Experience with multimodal model evaluation across text image table and document modalities
Familiarity with document understanding tasks: classification extraction structured QA and OCR-based pipelines
Hands-on experience with vision-language models (VLMs) CLIP-style architectures or transformer-based multimodal systems

LLM Systems & Post-Training

Strong understanding of post-training techniques (SFT DPO preference optimization) and how they interact with evaluation outcomes
Experience with LLM orchestration RAG pipeline design retrieval strategies (hybrid vector BM25) and guardrail validation
Familiarity with agentic frameworks (e.g. LangChain LangGraph) and multi-step reasoning evaluation

ML Engineering & Infrastructure

Strong Python skills for research experimentation data processing evaluation pipelines and statistical analysis
Hands-on experience with ML frameworks (PyTorch TensorFlow Hugging Face) and cloud platforms (AWS GCP or Azure)
Comfort with containerized deployment (Docker Kubernetes) experiment tracking and CI/CD for research pipelines

Quantitative Analysis & Scientific Rigor

Strong statistical analysis skills: sampling uncertainty quantification significance testing error analysis and metric interpretation
Ability to synthesize complex experimental findings into concise actionable recommendations for engineering and business stakeholders

Required Qualifications

Education: MS or PhD in Computer Science Machine Learning Data Science Statistics Applied Mathematics AI or a related quantitative field (PhD or strong MS research track preferred).
Research Experience: 3 years of relevant experience in applied ML research or research science with substantial work in LLMs foundation models or multimodal systems (graduate research counts).
LLM Evaluation Expertise: Demonstrated experience with LLM evaluation benchmarking post-training or model quality research.
Multimodal Experience: Hands-on work with multimodal models or document AI systems including tasks such as table QA image-text reasoning or OCR-based extraction.
Experimental Design: Strong foundation in experimental design statistical analysis and scientific reasoning applied to ML systems.
Technical Proficiency: Strong Python coding skills; experience with PyTorch Hugging Face or similar ML frameworks. Exposure to cloud infrastructure (AWS GCP or Azure) is a plus.
Communication: Strong written and verbal communication skills; able to present nuanced technical conclusions clearly to both research and non-technical audiences.

Preferred Qualifications

Post-Training Practice: Hands-on experience running SFT or preference optimization experiments with measurable evaluation outcomes.
RAG & Agentic Systems: Experience building or evaluating RAG pipelines agentic LLM orchestration layers or multi-turn interactive systems.
Document AI: Experience with document classification information extraction or structured QA over enterprise-scale document corpora.
Cloud & Deployment: Familiarity with containerized ML deployment (Docker/ECS) experiment logging (CloudWatch MLflow) and scalable inference infrastructure.
Data Annotation Tooling: Experience designing annotation interfaces real-time monitoring dashboards or quality control tooling for ML data pipelines.
Scientific Contribution: Publications and/or open-source benchmark contributions in LLM evaluation multimodal AI post-training or related areas at top venues (NeurIPS ICML ICLR ACL EMNLP CVPR etc.).
Applied Research Consulting: Experience in customer-facing applied research technical consulting or cross-functional product/research collaboration.
Safety & Governance: Familiarity with safety trustworthiness and governance considerations in GenAI evaluation.

How to Apply

Please send your CV a summary of key research contributions (publications benchmarks or open-source work) and a brief statement on your evaluation or post-training philosophy to:

Subject Line: Research Scientist LLM Evaluation & Post-Training

Salary: $150k-$180k

Centific is an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race color religion national origin ancestry citizenship status age mental or physical disability medical condition sex (including pregnancy) gender identity or expression sexual orientation marital status familial status veteran status or any other characteristic protected by applicable law. We consider qualified applicants regardless of criminal histories consistent with legal requirements.

Required Experience:

About CentificCentific is a frontier AI data foundry that curates diverse high-quality data using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe scalable AI deployment. Our team includes more than 150 PhDs and data scientists along with m...

About Centific

About Job

Key Responsibilities

Evaluation Framework Development: Design and validate comprehensive evaluation frameworks for LLM and multimodal systems including benchmark and task design automated scoring methods model-assisted evaluation human annotation protocols and robustness testing across document types and modalities.
Multimodal Benchmark Research: Lead research into multimodal evaluation covering document understanding table QA image-text reasoning and OCR-grounded extraction tasks. Develop benchmarks that measure model performance on structured and unstructured data sources representative of real enterprise workloads.
Fine-Tuning & Post-Training Experiments: Design and execute supervised fine-tuning (SFT) and preference optimization experiments to improve model performance on targeted tasks. Analyze how training objectives dataset composition and evaluation design interact to drive measurable model improvement.
RAG & Agentic System Evaluation: Develop evaluation protocols for retrieval-augmented generation (RAG) systems and agentic LLM pipelines assessing retrieval quality answer relevance citation grounding and multi-step reasoning fidelity across production-grade workflows.
Data Pipeline & Annotation Design: Architect data collection annotation schema design and quality control workflows for training and evaluation corpora. Define annotation guidelines inter-rater agreement criteria and adjudication procedures; build tooling to support annotator interfaces and real-time metric monitoring.
Model Behavior Analysis: Analyze model failure patterns across tasks and domains; generate actionable recommendations for evaluation redesign and fine-tuning strategy. Translate findings into practical improvements for customer solutions and Centifics internal platforms.
Cloud-Native Evaluation Infrastructure: Collaborate with ML engineers to build scalable containerized evaluation and fine-tuning pipelines on cloud platforms (AWS GCP or Azure). Integrate monitoring logging and experiment tracking to support reproducible research workflows.
Cross-Functional Collaboration: Partner with Language Data Scientists ML engineers and product teams to integrate human-in-the-loop evaluation synthetic data strategies and automated benchmarking into platform-level pipelines.
Customer Engagement: Engage with technical stakeholders at leading AI organizations to understand evaluation goals review methodologies and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
Knowledge & IP Creation: Contribute to internal benchmark datasets reusable evaluation frameworks and research assets. Produce technical documentation research reports and client-facing materials explaining methods results assumptions and limitations.
Thought Leadership: Advance Centifics position in LLM evaluation and multimodal AI through publications conference presentations and open-source benchmark contributions.

Core Technical Competencies

You will provide technical depth and leadership across the following domains:

Evaluation Science & Benchmarking

Expert-level benchmark dataset and test suite design for language and multimodal models
Deep understanding of metric design scoring reliability and measurement validity
Experience with human evaluation and quality assurance (rubric design inter-rater reliability adjudication)
Familiarity with precision-recall analysis threshold tuning and annotation-driven quality loops

Multimodal & Document AI

Experience with multimodal model evaluation across text image table and document modalities
Familiarity with document understanding tasks: classification extraction structured QA and OCR-based pipelines
Hands-on experience with vision-language models (VLMs) CLIP-style architectures or transformer-based multimodal systems

LLM Systems & Post-Training

Strong understanding of post-training techniques (SFT DPO preference optimization) and how they interact with evaluation outcomes
Experience with LLM orchestration RAG pipeline design retrieval strategies (hybrid vector BM25) and guardrail validation
Familiarity with agentic frameworks (e.g. LangChain LangGraph) and multi-step reasoning evaluation

ML Engineering & Infrastructure

Strong Python skills for research experimentation data processing evaluation pipelines and statistical analysis
Hands-on experience with ML frameworks (PyTorch TensorFlow Hugging Face) and cloud platforms (AWS GCP or Azure)
Comfort with containerized deployment (Docker Kubernetes) experiment tracking and CI/CD for research pipelines

Quantitative Analysis & Scientific Rigor

Strong statistical analysis skills: sampling uncertainty quantification significance testing error analysis and metric interpretation
Ability to synthesize complex experimental findings into concise actionable recommendations for engineering and business stakeholders

Required Qualifications

Education: MS or PhD in Computer Science Machine Learning Data Science Statistics Applied Mathematics AI or a related quantitative field (PhD or strong MS research track preferred).
Research Experience: 3 years of relevant experience in applied ML research or research science with substantial work in LLMs foundation models or multimodal systems (graduate research counts).
LLM Evaluation Expertise: Demonstrated experience with LLM evaluation benchmarking post-training or model quality research.
Multimodal Experience: Hands-on work with multimodal models or document AI systems including tasks such as table QA image-text reasoning or OCR-based extraction.
Experimental Design: Strong foundation in experimental design statistical analysis and scientific reasoning applied to ML systems.
Technical Proficiency: Strong Python coding skills; experience with PyTorch Hugging Face or similar ML frameworks. Exposure to cloud infrastructure (AWS GCP or Azure) is a plus.
Communication: Strong written and verbal communication skills; able to present nuanced technical conclusions clearly to both research and non-technical audiences.

Preferred Qualifications

Post-Training Practice: Hands-on experience running SFT or preference optimization experiments with measurable evaluation outcomes.
RAG & Agentic Systems: Experience building or evaluating RAG pipelines agentic LLM orchestration layers or multi-turn interactive systems.
Document AI: Experience with document classification information extraction or structured QA over enterprise-scale document corpora.
Cloud & Deployment: Familiarity with containerized ML deployment (Docker/ECS) experiment logging (CloudWatch MLflow) and scalable inference infrastructure.
Data Annotation Tooling: Experience designing annotation interfaces real-time monitoring dashboards or quality control tooling for ML data pipelines.
Scientific Contribution: Publications and/or open-source benchmark contributions in LLM evaluation multimodal AI post-training or related areas at top venues (NeurIPS ICML ICLR ACL EMNLP CVPR etc.).
Applied Research Consulting: Experience in customer-facing applied research technical consulting or cross-functional product/research collaboration.
Safety & Governance: Familiarity with safety trustworthiness and governance considerations in GenAI evaluation.

How to Apply

Please send your CV a summary of key research contributions (publications benchmarks or open-source work) and a brief statement on your evaluation or post-training philosophy to:

Subject Line: Research Scientist LLM Evaluation & Post-Training

Salary: $150k-$180k

Centific is an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race color religion national origin ancestry citizenship status age mental or physical disability medical condition sex (including pregnancy) gender identity or expression sexual orientation marital status familial status veteran status or any other characteristic protected by applicable law. We consider qualified applicants regardless of criminal histories consistent with legal requirements.

Required Experience:

Apply Now

About Company

Centific

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

AIML Research Scientist, LLM Post-Training & Evaluation

Redmond, WA - USA

Job Summary

Key Responsibilities

Core Technical Competencies

Evaluation Science & Benchmarking

Multimodal & Document AI

LLM Systems & Post-Training

ML Engineering & Infrastructure

Quantitative Analysis & Scientific Rigor

Required Qualifications

Preferred Qualifications

How to Apply

Key Responsibilities

Core Technical Competencies

Evaluation Science & Benchmarking

Multimodal & Document AI

LLM Systems & Post-Training

ML Engineering & Infrastructure

Quantitative Analysis & Scientific Rigor

Required Qualifications

Preferred Qualifications

How to Apply

About Company

Related Jobs