Mostly AI safety and cooperative AI. When multiple agents, human or artificial, recognise each other, coordinate, and decide whether to cooperate. The papers, with a short note on what each one's actually about.
Selected research
Google Scholar →-
DarkBench: Benchmarking Dark Patterns in Large Language Models
660 adversarial prompts across six categories of manipulative behaviour, evaluated against 14 open and proprietary models, uncovering widespread dark patterns and ethical gaps in current systems.
-
Do LLMs Take Care of Their Own? Similarity Signals Can Induce Cooperation
Evidence that similarity signals between agents, alone, can be enough to induce cooperation, with consequences for how multi-agent systems are built and governed.
-
MMTEB: Massive Multilingual Text Embedding Benchmark
An expansion of MTEB to 500+ evaluation tasks across 1,000+ languages. My co-authorship came through open-source contributions that reduced computational demand and improved benchmarking efficiency.
-
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
A benchmark for multi-agent alignment under conflicting objectives, built out of the Google DeepMind × Cooperative AI Foundation hackathon that preceded the NeurIPS Concordia Contest, where our team ranked among the top, leading to co-authorship.
-
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
A new evaluation ecosystem for the second-order, real-world effects of AI systems: the failures and societal impacts that only surface once the model leaves the lab. I led red-teaming and sandbox human evaluations for the paper.
-
Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific
The first multicultural and multilingual AI safety red-teaming challenge in the Asia-Pacific region: a large-scale study with 54 participants from 9 countries, evaluating LLMs across diverse cultural and linguistic contexts.