REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani
- 🏛 Institutions
- The AGI Company, Stanford, Oxford, Mercor, Contramont Research, Plato, Independent
- 📅 Date
- April 15, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- Web
- 🔑 Keywords
TLDR
REAL benchmarks autonomous web agents on deterministic replicas of 11 real websites so evaluation stays realistic while remaining safe and reproducible. It pairs 112 practical multi-turn tasks with an evaluation harness that mixes programmatic state checks and rubric-guided LLM judgments, and reports frontier agents reaching only about 41% success.
Related papers
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal ModelsJanuary 25, 2024 · ACL 2024
- SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationOctober 19, 2024 · ICLR 2025 (Spotlight)
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv