REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Divyansh Garg , Shaun VanWeelden , Diego Caples , Andis Draguns , Nikil Ravi , Pranav Putta , Naman Garg , Tomas Abraham , Michael Lara , Federico Lopez , James Liu , Atharva Gundawar , Prannay Hebbar , Youngchul Joo , Jindong Gu , Charles London , Christian Schroeder de Witt , Sumeet Motwani

🏛 Institutions: The AGI Company , Stanford , Oxford , Mercor , Contramont Research , Plato , Independent
📅 Date: April 15, 2025
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark deterministic website replicas automatic evaluation evaluation harness reproducibility REAL

TLDR

REAL benchmarks autonomous web agents on deterministic replicas of 11 real websites so evaluation stays realistic while remaining safe and reproducible. It pairs 112 practical multi-turn tasks with an evaluation harness that mixes programmatic state checks and rubric-guided LLM judgments, and reports frontier agents reaching only about 41% success.

Open paper arXiv Report issue