About

Linda Petrini

AI safety researcher and technical writer. Seven years in machine learning research, from Google Brain to frontier AI safety.

I’ve spent seven years in AI research, from interpretability and representation learning at Google Brain and Mila to frontier AI safety work at Anthropic and Palisade Research. My focus is on understanding and evaluating loss-of-control risks from advanced AI systems: alignment faking, deception, sabotage, and self-exfiltration.

Research background

I started in fundamental ML research: an AI Residency and then Research Associate position at Google Brain (2019–2023), where I led an interdisciplinary team applying deep learning interpretability to epigenomics. Earlier, I studied zero-shot learning and mutual information-based interpretability at Mila (MSc thesis published at ICLR 2020). My PhD work at Mila with Marc Bellemare and Aaron Courville focused on interpretability for scientific discovery in epigenomics.

Current work

I currently contribute to AI safety research at Anthropic (contributing author on alignment faking, SHADE-Arena, constitutional classifiers, pretraining data filtering, and inverse scaling in test-time compute) and report on frontier AI risks at Palisade Research, where I also contribute to misalignment evaluation and self-replication assessment projects. I produce technical reports and policy analysis for the Foresight Institute (Secure AI Tech Tree, hyper-entities research) and the Bezos Earth Fund (AI and Climate report).

Beyond research

Outside of AI, I live in rural Italy, teach acroyoga, and am training in Rolfing. I write essays on AI and technology on my Substack.

Open to research collaborations and new opportunities in AI safety evaluation and alignment.

Get in touch See my work