Updates
Name | Horizon | # of Tasks | Time-Varying | Complex Answer | Attribution Verification | Evaluation |
---|---|---|---|---|---|---|
Mind2Web 2 | Long | 130 | Agent-as-a-Judge | |||
Online-Mind2Web | Short | 300 | LLM-as-a-Judge | |||
WebVoyager | Short | 643 | LLM-as-a-Judge | |||
Mind2Web-Live | Short | 542 | Rule | |||
BEARCUBS | Short | 111 | Answer Match | |||
WebWalkerQA | Short | 680 | Answer Match | |||
GAIA | Medium | 466 | Answer Match | |||
AssistantBench | Medium | 214 | Answer Match | |||
BrowseComp | Long | 1,266 | Answer Match |
Select a task from the dropdown to view its description.
Select a task from the dropdown to view its description and evaluation rubric.
Select a task to view its evaluation rubric.
@inproceedings{
gou2025mindweb,
title={Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge},
author={Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Botao Yu and Andrei Kopanev and Weijian Qi and Yiheng Shu and Jiaman Wu and Chan Hee Song and Bernal Jimenez Gutierrez and Yifei Li and Zeyi Liao and Hanane Nour Moussa and TIANSHU ZHANG and Jian Xie and Tianci Xue and Shijie Chen and Boyuan Zheng and Kai Zhang and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=AUaW6DS9si}
}