Name | Horizon | # of Tasks | Time-Varying | Evaluation |
---|---|---|---|---|
Mind2Web 2 | Long | 130 | Agent-as-a-Judge | |
Online-Mind2Web | Short | 300 | LLM-as-a-Judge | |
WebVoyager | Short | 643 | LLM-as-a-Judge | |
Mind2Web-Live | Short | 542 | Rule | |
BEARCUBS | Short | 111 | Manual Evaluation | |
WebWalkerQA | Short | 680 | Answer Match | |
GAIA | Medium | 466 | Answer Match | |
AssistantBench | Medium | 214 | Answer Match | |
BrowseComp | Long | 1,266 | Answer Match |
Select a task from the dropdown to view its description.
Select a task from the dropdown to view its description and evaluation rubric.
Select a task to view its evaluation rubric.
@misc{gou2025mind2web2,
title = {Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge},
author = {Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Weijian Qi and Andrei Kopanev and Botao Yu and Bernal Jiménez Gutiérrez and Yiheng Shu and Chan Hee Song and Jiaman Wu and Shijie Chen and Hanane Nour Moussa and Tianshu Zhang and Jian Xie and Yifei Li and Tianci Xue and Zeyi Liao and Kai Zhang and Boyuan Zheng and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
year = {2025},
eprint = {2506.21506},
archivePrefix = {arXiv},
primaryClass = {cs.AI}
}