Research Scientist – Science of Evaluation

Not Interested
Bookmark
Report This Job

profile Job Location:

London - UK

profile Monthly Salary: £ 65000 - 145000
Posted on: 15 hours ago
Vacancies: 1 Vacancy

Job Summary

About the AI Security Institute

The AI Security Institute is the worlds largest and best-funded team dedicated to understanding advanced AI risks and translating that knowledge into action. Were in the heart of the UK government with direct lines to No. 10 (the Prime Ministers office) and we work with frontier developers and governments globally.

Were here because governments are critical for advanced AI going well and UK AISI is uniquely positioned to mobilise them. With our resources unique agility and international influence this is the best place to shape both AI development and government action.

The deadline for applying to this role is February 22 2026 end of day anywhere on Earth.

About the Team

AISIs Science of Evaluation team develops rigorous techniques for measuring and forecasting AI capabilities ensuring evaluation results are robust meaningful and useful for governance.

Evaluations underpin both scientific understanding and policy decisions about frontier AI. Yet current methodologies are poorly equipped to surface what matters most: underlying capabilities dangerous failure modes forecasts of future performance and robustness across settings. We address this gap by stress-testing the claims and methods in AISIs testing reports improving evaluation methods and building new analytical tools. Our research is problem-driven methodologically grounded and focused on impact. We aim to improve epistemic rigour and increase confidence in the claims drawn from evaluation data.

Our approach involves:

(1) Methodological red teaming:Independently auditing evidence and claims in evaluation reports shared with model developers.

(2) Consulting partnerships:Collaborating with AISI evaluation teams to improve methodologies and practices.

(3) Targeted research bets:Pursuing foundational work that enables new insights into model capabilities.

New research agenda focus (in addition to core team responsibilities):

Frontier agents increasingly use massive inference budgets on complex long-horizon tasks. This makes measuring model horizons estimating performance ceilings andmaintainingresearch velocity harder and more evaluation methods thatremaininformative as task budgets exceed 10M tokens per attempt and model horizons surpass the longest available tasks.

Role Summary

This research scientist role focuses on evaluation methods for frontier AI with emphasis on long-horizon agents and inference-compute scaling.

Youlldesign and conductexperimentsthatextractsdeepersignalfrom evaluation datauncovering underlying with engineers and domain experts across AISI and with external partners. Researchers on this team have substantial autonomy to shape independentagendas andpush the frontier of what evaluations can reveal.

Example Projects

  • Develop methods to forecast long-horizon performance under increasing inference budgets including predictive models based on task and model characteristics
  • Design approaches that preserve observability when agents exceed available task lengths (e.g. proxy measurements task decomposition data acquisition strategies)
  • Support evaluation suite design for improved coverage predictive validity and robustness
  • Engineer tools for quantitative transcript analysis to identify failure modes and capability signals

Responsibilities

  • Applied research on evaluationmethodology including new techniques and tools
  • Run andanalyzeevaluation results to stress-test claims characterize model capabilities and inform policy-relevantreports
  • Track the state of the art in frontier AI evaluation research across AISI and externally and contribute to AISIs presence at ML conferences
  • Long-horizon / inference scaling focus:
    • Design and run experiments that are more informative than end-to-end pass/fail metrics
    • Develop and engineer approaches to long-horizon task design including automation and internal structure (checkpoints bottlenecks progress metrics)
    • Estimate capability upper bounds by identifying measurable bottleneck skills relevant to long-horizon performance.

Person Specification

Were flexible on exact background and expect successful candidates to meet many (but not necessarily all) criteria below. Depending on experience well consider candidates at Research Scientist or Senior Research Scientist level. We also welcome applications from earlier-career researchers (23 years of hands-on LLM experience) who demonstrate creative and rigorous empirical instincts.

Essential

  • Strongtrack recordin applied ML evaluation science or experimental fields with significantmethodological challenges (e.g. PhD in a technical field publications at top-tiervenues() or substantial real-world deployments)
  • Significant hands-on experience with LLMs and agents
  • Strong motivation for impactful work at the intersection of science safety and governance
  • Self-directed and adaptable; comfortable with ambiguity in a growing team

Nice to Have

  • Task design and validation experience (checkpoints verifiers progress metrics)
  • Transcript analysis or behavioural measurement
  • Experimental design or measurement tooling from other disciplines (psychometrics behavioural economics).

Core Logistical Requirements

  • You should be able to spend at least 4 days per week on working with us
  • You should be able to join us for at least 18 months
  • You should be able work from our office in London for parts of the week but we provide flexibility for remote work

What We Offer

Impactyoucouldnthave anywhere else

  • Incredibly talented mission-drivenand supportive colleagues.
  • Direct influence on how frontier AI is governed and deployed globally.
  • Work with the Prime Ministers AI Advisor and leading AI companies.
  • Opportunity to shape the first & best-resourced public-interest research team focused on AI security.

Resources & access

  • Pre-release access to multiple frontier models and ample compute.
  • Extensive operational support so you can focus on research and ship quickly.
  • Work with experts across national security policy AIresearchand adjacent sciences.

Growth & autonomy

  • Ifyouretalented and drivenyoullown important problems early.
  • 5 days off learning and development annual stipends for learning and development and funding for conferences and external collaborations.
  • Freedom to pursue research bets without product pressure.
  • Opportunities to publish and collaborate externally.

Life & family*

  • Modern central London office (cafes food court gym)orwhere applicableoptionto work in similar government offices in Birmingham Cardiff Darlington EdinburghSalfordor Bristol.
  • Hybrid working flexibility for occasional remote work abroad and stipends for work-from-home equipment.
  • At least 25 days annual leave 8 public holidays extra team-widebreaksand 3 days off for volunteering.
  • Generous paid parental leave (36 weeks of UK statutory leave shared between parents 3 extra paid weeks option foradditionalunpaid time).
  • On top of your salary we contribute 28.97% of your base salary to your pension.
  • Discounts and benefits for cycling to work donations and retail/gyms.

*These benefits apply to direct employees. Benefits may differ for individuals joining through other employment arrangements such as secondments.

Salary

Annual salary is benchmarked to role scope and relevant experience. Most offers land between 65000 and 145000 made up of a base salary plus a technical allowance (take-home salary base technical allowance). Anadditional28.97% employer pension contribution is paid on the base salary.

This role sits outside of theDDaT pay frameworkgiven the scope of this role requires in depth technicalexpertisein frontier AI safetyrobustnessand advanced AI architectures.

The full range of salaries are available below:

  • Level 3:(Base35720 Technical Allowance)
  • Level 4:(Base42495 Technical Allowance)
  • Level 5:(Base55805 Technical Allowance)
  • Level 6:(Base68770 Technical Allowance)
  • Level 7:145000(Base68770 Technical Allowance76230)

Selection Process

In accordance withtheCivil Service Commissionrules the following listcontainsall selection criteria for the interview process.

The interview process may vary candidate tocandidatehowever you should expect a typical process to include some technicalproficiencytests discussions with a cross-section of our team at AISI (including non-technical staff) conversations with your workstream lead. The process will culminate in a conversation with members of the senior team here at AISI.

Candidates should expect to go throughsome orallofthe following stages once an application has beensubmitted:

  • Initial interview
  • Technical take home test
  • Second interview and review oftake hometest
  • Final interview with members of the seniorleadershipteam

Additional Information

Use of AI in Applications

Artificial Intelligence can be a useful tool to support your application however all examples and statements provided must be truthful factually accurate and taken directly from your own experience. Where plagiarism has been identified (presenting the ideas and experiences of others or generated by artificial intelligence as your own) applications may be withdrawn and internal candidates may be subject to disciplinary action. Please see ourcandidate guidancefor more information on appropriate and inappropriate use.

Internal Fraud Database

The Internal Fraud function of the Fraud Error Debt and Grants Function at the Cabinet Office processes details of civil servants who have been dismissed for committing internal fraud or who would have been dismissed had they not resigned. The Cabinet Office receives the details from participating government organisations of civil servants who have been dismissed or who would have been dismissed had they not resigned for internal instances such as this civil servants are then banned for 5 years from further employment in the civil service. The Cabinet Office then processes this data and discloses a limited dataset back to DLUHC as a participating government organisations. DLUHC then carry out the pre employment checks so as to detect instances where known fraudsters are attempting to reapply for roles in the civil this way the policy is ensured and the repetition of internal fraud is prevented. For more information please see -Internal Fraud Register.

Security

Successful candidates must undergo a criminal record check and getbaseline personnel security standard (BPSS)clearancebefore they can be appointed. Additionally there is a strong preference for eligibility forcounter-terrorist check (CTC)clearance. Some roles may require higher levels of clearance and we will state this by exception in the job advertisement.See our vetting charter here.

Nationality requirements

We may be able to offer roles to applicant from any nationality or background. As such we encourage you to apply even if you do not meet the standard nationality requirements (opens in a new window).

Diversity and Inclusion

The Civil Service is committed to attract retain and invest in talent wherever it is found. To learn more please see theCivil Service People Plan (opens in a new window)and theCivil Service Diversity and Inclusion Strategy (opens in a new window).

Required Experience:

IC

About the AI Security InstituteThe AI Security Institute is the worlds largest and best-funded team dedicated to understanding advanced AI risks and translating that knowledge into action. Were in the heart of the UK government with direct lines to No. 10 (the Prime Ministers office) and we work wit...
View more view more

Key Skills

  • Laboratory Experience
  • Machine Learning
  • Python
  • AI
  • Bioinformatics
  • C/C++
  • R
  • Biochemistry
  • Research Experience
  • Natural Language Processing
  • Deep Learning
  • Molecular Biology