Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou , Zanming Huang , Yuting Ning , Yu Gu , Michael Lin , Weijian Qi , Andrei Kopanev , Botao Yu , Bernal Jiménez Gutiérrez , Yiheng Shu , Chan Hee Song , Jiaman Wu , Shijie Chen , Hanane Nour Moussa , Tianshu Zhang , Jian Xie , Yifei Li , Tianci Xue , Zeyi Liao , Kai Zhang , Boyuan Zheng , Zhaowei Cai , Viktor Rozgic , Morteza Ziyadi , Huan Sun , Yu Su

🏛 Institutions: OSU , Amazon AGI
📅 Date: September 18, 2025
📑 Publisher: NeurIPS 2025 Datasets & Benchmarks Track (Poster)
💻 Env: Web
🔑 Keywords: agentic search agent-as-a-judge tree-structured rubric source attribution human evaluation Mind2Web 2

TLDR

Mind2Web 2 benchmarks long-horizon agentic search with 130 human-crafted tasks that require real-time browsing and citation-backed synthesis. It evaluates systems with task-specific judge agents built from tree-structured rubrics that score both answer correctness and source attribution, and compares ten frontier systems against human performance.

Open paper Report issue