An Illusion of Progress? Assessing the Current State of Web Agents

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su

🏛 Institutions: OSU, UC Berkeley
📅 Date: April 2, 2025
📑 Publisher: COLM 2025
💻 Env: Web
🔑 Keywords: benchmark realistic website evaluation online-Mind2Web WebJudge LLM-as-a-judge

TLDR

This paper argues that reported web-agent progress is overstated once agents are evaluated on more realistic online tasks. It introduces Online-Mind2Web with 300 tasks across 136 live websites, pairs it with the WebJudge automatic evaluation method, and uses that setup to show a much weaker picture of current web-agent capability than prior benchmarks suggest.

Open paper Edit on GitHub Report issue