SWE Expert

Mercor

Job Location:

Berlin - Germany

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Role Overview

Mercor is seeking SWE Experts to support the design of evaluation-ready workflows for advanced AI systems. This engagement focuses on translating ambiguous requirements into structured repeatable artifacts that can be tested automatically. Youll produce clearly specified deliverables (documentation scripts) that enable consistent assessment of agent performance across scenarios. Work is contract-based outcome-oriented and optimized for reproducibility and clear acceptance criteria.

Key Responsibilities

Convert high-level objectives into tightly scoped testable deliverables with clear inputs/outputs and measurable success criteria.
Create structured documentation that defines expected behavior constraints and edge cases in a way other evaluators can reuse.
Build lightweight automation scripts to support evaluation flows (e.g. generating required artifacts validating outputs enforcing format rules).
Write deterministic Python verifier scripts that check completion via final state or output validation (files directories content assertions).
Design prompts/tasks that reliably elicit the target workflow behavior while avoiding leakage of internal instructions or implementation details.
Implement robust error handling and actionable failure messages in verification tooling.
Develop plausible but ineffective baseline or distractor approaches to confirm evaluation discrimination (i.e. the solution must use the intended approach).
Maintain clean artifact hygiene: versionable structure consistent naming minimal ambiguity and reproducible execution.

Ideal Qualifications

Strong Python skills (file system operations parsing validation test-style assertions deterministic execution).
Experience with evaluation harnesses automated grading or QA-style verification (unit/integration test mindset).
Familiarity with prompt design and LLM evaluation methodologies (closed-ended tasks leakage avoidance reliability testing).
Comfort with structured specs and documentation conventions (Markdown YAML frontmatter patterns well-scoped requirements).
Working knowledge of common developer tooling: Git CLI workflows virtual environments dependency management.
Bonus: embeddings/similarity concepts (e.g. cosine similarity) for looks relevant but fails negative-control design.
Ability to communicate clearly and keep scope controlled without relying on domain-specific context.

More About the Opportunity

Deliverables are primarily documentation scripts intended to support automated evaluation and consistent replay.
Emphasis on: determinism reproducibility closed-ended outcomes and strong verifier reliability.
Tasks and validators should be resilient to superficial shortcuts and confirm the intended workflow is actually used.
Work can include designing negative controls (distractors) that appear credible while failing for principled reasons.
Time-sensitive elements should be explicitly date-bounded where applicable.

Role Overview Mercor is seeking SWE Experts to support the design of evaluation-ready workflows for advanced AI systems. This engagement focuses on translating ambiguous requirements into structured repeatable artifacts that can be tested automatically. Youll produce clearly specified deliverables (...

Role Overview

Key Responsibilities

Convert high-level objectives into tightly scoped testable deliverables with clear inputs/outputs and measurable success criteria.
Create structured documentation that defines expected behavior constraints and edge cases in a way other evaluators can reuse.
Build lightweight automation scripts to support evaluation flows (e.g. generating required artifacts validating outputs enforcing format rules).
Write deterministic Python verifier scripts that check completion via final state or output validation (files directories content assertions).
Design prompts/tasks that reliably elicit the target workflow behavior while avoiding leakage of internal instructions or implementation details.
Implement robust error handling and actionable failure messages in verification tooling.
Develop plausible but ineffective baseline or distractor approaches to confirm evaluation discrimination (i.e. the solution must use the intended approach).
Maintain clean artifact hygiene: versionable structure consistent naming minimal ambiguity and reproducible execution.

Ideal Qualifications

Strong Python skills (file system operations parsing validation test-style assertions deterministic execution).
Experience with evaluation harnesses automated grading or QA-style verification (unit/integration test mindset).
Familiarity with prompt design and LLM evaluation methodologies (closed-ended tasks leakage avoidance reliability testing).
Comfort with structured specs and documentation conventions (Markdown YAML frontmatter patterns well-scoped requirements).
Working knowledge of common developer tooling: Git CLI workflows virtual environments dependency management.
Bonus: embeddings/similarity concepts (e.g. cosine similarity) for looks relevant but fails negative-control design.
Ability to communicate clearly and keep scope controlled without relying on domain-specific context.

More About the Opportunity

Deliverables are primarily documentation scripts intended to support automated evaluation and consistent replay.
Emphasis on: determinism reproducibility closed-ended outcomes and strong verifier reliability.
Tasks and validators should be resilient to superficial shortcuts and confirm the intended workflow is actually used.
Work can include designing negative controls (distractors) that appear credible while failing for principled reasons.
Time-sensitive elements should be explicitly date-bounded where applicable.