GUI Agents Papers
Star · 821

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou , Zanming Huang , Yuting Ning , Yu Gu , Michael Lin , Weijian Qi , Andrei Kopanev , Botao Yu , Bernal Jiménez Gutiérrez , Yiheng Shu , Chan Hee Song , Jiaman Wu , Shijie Chen , Hanane Nour Moussa , Tianshu Zhang , Jian Xie , Yifei Li , Tianci Xue , Zeyi Liao , Kai Zhang , Boyuan Zheng , Zhaowei Cai , Viktor Rozgic , Morteza Ziyadi , Huan Sun , Yu Su

🏛 Institutions
OSU , Amazon AGI
📅 Date
September 18, 2025
📑 Publisher
NeurIPS 2025 Datasets & Benchmarks Track (Poster)
💻 Env
Web
🔑 Keywords
TLDR

Mind2Web 2 benchmarks long-horizon agentic search with 130 human-crafted tasks that require real-time browsing and citation-backed synthesis. It evaluates systems with task-specific judge agents built from tree-structured rubrics that score both answer correctness and source attribution, and compares ten frontier systems against human performance.

Open paper Report issue
Related papers (24)