An Illusion of Progress? Assessing the Current State of Web Agents
Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su
- 🏛 Institutions
- OSU, UC Berkeley
- 📅 Date
- April 2, 2025
- 📑 Publisher
- COLM 2025
- 💻 Env
- Web
- 🔑 Keywords
TLDR
This paper argues that reported web-agent progress is overstated once agents are evaluated on more realistic online tasks. It introduces Online-Mind2Web with 300 tasks across 136 live websites, pairs it with the WebJudge automatic evaluation method, and uses that setup to show a much weaker picture of current web-agent capability than prior benchmarks suggest.
Related papers
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- Ego2Web: A Web Agent Benchmark Grounded in Egocentric VideosMarch 23, 2026 · CVPR 2026
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?October 21, 2024 · EMNLP 2024 (Poster)
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv