Chloe Li
I spend my time thinking about how to make powerful AI systems honest and aligned with human values. I’m currently a member of technical staff at Anthropic, where I work on alignment training.
Previously, I was an Anthropic Fellow, where I worked on model spec midtraining as an alignment technique. I’ve also worked on honesty training, chain-of-thought monitoring and control evaluations. Before that, I spent time on AI safety field-building, like leading ARENA (a ML engineering program for upskilling people in technical AI safety) and directing Cambridge AI Safety Hub, where I helped founded research programs like MARS. I have a MSc in machine learning from UCL and a BA Hons in psychology & neuroscience from the University of Cambridge.
Publications & other work
-
Model Spec Midtraining: Improving How Alignment Training Generalizes
Chloe Li, Nevan Wichers, Sara Price, Samuel Marks, Jon Kutasov (2026) -
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
Chloe Li, Mary Phuong, Daniel Tan (2025)
ICLR 2026 -
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Chloe Li, Mary Phuong, Noah Y. Siegel (2025)
IJCNLP-AACL 2025, Oral Presentation