GUI Agents Papers
Star · 751

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani

🏛 Institutions
The AGI Company, Stanford, Oxford, Mercor, Contramont Research, Plato, Independent
📅 Date
April 15, 2025
📑 Publisher
arXiv
💻 Env
Web
🔑 Keywords
TLDR

REAL benchmarks autonomous web agents on deterministic replicas of 11 real websites so evaluation stays realistic while remaining safe and reproducible. It pairs 112 practical multi-turn tasks with an evaluation harness that mixes programmatic state checks and rubric-guided LLM judgments, and reports frontier agents reaching only about 41% success.

Open paper arXiv Edit on GitHub Report issue
Related papers