The Role:
We are seeking an exceptional AI Evaluation Engineer to design implement and scale frameworks for assessing the performance reliability and trustworthiness of advanced AI systems. This individual will be responsible for developing methodologies and tools to measure model quality across diverse dimensions such as accuracy robustness reasoning safety and efficiency.
Key Responsibilities:
- Design and Develop Evaluation Frameworks: Create scalable reproducible evaluation pipelines for large-scale AI systems including LLMs and multi-agent architectures covering both automated and human-in-the-loop testing strategies.
- Metric Innovation: Define and implement novel evaluation metrics that capture model capabilities beyond traditional benchmarks.
- Benchmarking & Performance Analysis: Conduct benchmarking of AI models across domains tasks modalities analyzing their skills and behavior under different setups.
- Safety Reliability & Alignment Testing: Develop tools and experiments to probe model safety robustness interpretability and bias.
- Cross-functional Collaboration: Work closely with model finetuning and optimization teams to evaluate end-to-end system effectiveness efficiency. Identify trade-offs between model performance latency and energy footprint.
- Continuous Improvement & Reporting: Monitor model performance over time automate regression detection and contribute to the continuous evaluation infrastructure that supports Openchips AI research and product roadmap.
Qualifications:
- MSc or PhD in Computer Science Artificial Intelligence Machine Learning Statistics or a related field. A publication record in ML evaluation benchmarking or interpretability is a plus.
- 3 years of experience developing evaluating or optimizing AI systems.
- Strong programming skills in Python with experience using PyTorch TensorFlow or JAX.
- Experience in designing evaluation protocols for LLMs multi-agent systems or reinforcement learning environments.
- Deep understanding of ML metrics evaluation methodologies and statistical analysis.
- Experience with data quality annotation workflows and benchmark dataset creation is a plus.
- Fluent in English; proficiency in additional European languages (German Dutch Spanish French or Italian) is a plus.
Soft Skills:
- Analytical Rigor: An evidence-driven mindset that enjoys designing robust experiments to quantify and uncover complex AI behaviors translating empirical insights into new research directions.
- Collaboration & Communication: Excellent communication and collaboration skills in a multidisciplinary environment.
- Integrity & Responsibility: Committed to building AI systems that are not only powerful but also safe reliable and aligned with human values.
What We Offer
- The opportunity to build a cloud AI deployment platform that will power next generation AI systems.
- A collaborative innovation-driven environment with significant autonomy and ownership.
- Hybrid work model with flexible scheduling.
- A chance to join one of Europes most ambitious companies at the intersection of AI and silicon engineering.
- Position based in Barcelona.
Were looking for exceptional engineers ready to shape the future of AI infrastructure. If building scalable cloud-native AI deployment platforms excites you wed love to meet you.
At Openchip & Software Technologies S.L. we believe a diverse and inclusive team is the key to groundbreaking ideas. We foster a work environment where everyone feels valued respected and empowered to reach their full potentialregardless of race gender ethnicity sexual orientation or gender identity.
The Role:We are seeking an exceptional AI Evaluation Engineer to design implement and scale frameworks for assessing the performance reliability and trustworthiness of advanced AI systems. This individual will be responsible for developing methodologies and tools to measure model quality across dive...
The Role:
We are seeking an exceptional AI Evaluation Engineer to design implement and scale frameworks for assessing the performance reliability and trustworthiness of advanced AI systems. This individual will be responsible for developing methodologies and tools to measure model quality across diverse dimensions such as accuracy robustness reasoning safety and efficiency.
Key Responsibilities:
- Design and Develop Evaluation Frameworks: Create scalable reproducible evaluation pipelines for large-scale AI systems including LLMs and multi-agent architectures covering both automated and human-in-the-loop testing strategies.
- Metric Innovation: Define and implement novel evaluation metrics that capture model capabilities beyond traditional benchmarks.
- Benchmarking & Performance Analysis: Conduct benchmarking of AI models across domains tasks modalities analyzing their skills and behavior under different setups.
- Safety Reliability & Alignment Testing: Develop tools and experiments to probe model safety robustness interpretability and bias.
- Cross-functional Collaboration: Work closely with model finetuning and optimization teams to evaluate end-to-end system effectiveness efficiency. Identify trade-offs between model performance latency and energy footprint.
- Continuous Improvement & Reporting: Monitor model performance over time automate regression detection and contribute to the continuous evaluation infrastructure that supports Openchips AI research and product roadmap.
Qualifications:
- MSc or PhD in Computer Science Artificial Intelligence Machine Learning Statistics or a related field. A publication record in ML evaluation benchmarking or interpretability is a plus.
- 3 years of experience developing evaluating or optimizing AI systems.
- Strong programming skills in Python with experience using PyTorch TensorFlow or JAX.
- Experience in designing evaluation protocols for LLMs multi-agent systems or reinforcement learning environments.
- Deep understanding of ML metrics evaluation methodologies and statistical analysis.
- Experience with data quality annotation workflows and benchmark dataset creation is a plus.
- Fluent in English; proficiency in additional European languages (German Dutch Spanish French or Italian) is a plus.
Soft Skills:
- Analytical Rigor: An evidence-driven mindset that enjoys designing robust experiments to quantify and uncover complex AI behaviors translating empirical insights into new research directions.
- Collaboration & Communication: Excellent communication and collaboration skills in a multidisciplinary environment.
- Integrity & Responsibility: Committed to building AI systems that are not only powerful but also safe reliable and aligned with human values.
What We Offer
- The opportunity to build a cloud AI deployment platform that will power next generation AI systems.
- A collaborative innovation-driven environment with significant autonomy and ownership.
- Hybrid work model with flexible scheduling.
- A chance to join one of Europes most ambitious companies at the intersection of AI and silicon engineering.
- Position based in Barcelona.
Were looking for exceptional engineers ready to shape the future of AI infrastructure. If building scalable cloud-native AI deployment platforms excites you wed love to meet you.
At Openchip & Software Technologies S.L. we believe a diverse and inclusive team is the key to groundbreaking ideas. We foster a work environment where everyone feels valued respected and empowered to reach their full potentialregardless of race gender ethnicity sexual orientation or gender identity.
View more
View less