GUI Agents Papers
Star · 821

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Divyansh Garg , Shaun VanWeelden , Diego Caples , Andis Draguns , Nikil Ravi , Pranav Putta , Naman Garg , Tomas Abraham , Michael Lara , Federico Lopez , James Liu , Atharva Gundawar , Prannay Hebbar , Youngchul Joo , Jindong Gu , Charles London , Christian Schroeder de Witt , Sumeet Motwani

🏛 Institutions
The AGI Company , Stanford , Oxford , Mercor , Contramont Research , Plato , Independent
📅 Date
April 15, 2025
📑 Publisher
arXiv
💻 Env
Web
🔑 Keywords
TLDR

REAL benchmarks autonomous web agents on deterministic replicas of 11 real websites so evaluation stays realistic while remaining safe and reproducible. It pairs 112 practical multi-turn tasks with an evaluation harness that mixes programmatic state checks and rubric-guided LLM judgments, and reports frontier agents reaching only about 41% success.

Open paper arXiv Report issue
Related papers (24)