Mind2Web 2

Evaluating Agentic Search with Agent-as-a-Judge

Boyu Gou^1*, Zanming Huang^1*, Yuting Ning^1*, Yu Gu¹, Michael Lin¹, Weijian Qi¹, Andrei Kopanev¹, Botao Yu¹, Bernal Jiménez Gutiérrez¹, Yiheng Shu¹, Chan Hee Song¹, Jiaman Wu¹, Shijie Chen¹, Hanane Nour Moussa¹, Tianshu Zhang¹, Jian Xie¹, Yifei Li¹, Zeyi Liao¹, Tianci Xue¹, Kai Zhang¹, Boyuan Zheng¹, Zhaowei Cai², Viktor Rozgic², Morteza Ziyadi², Huan Sun¹, Yu Su¹

¹The Ohio State University ²Amazon AGI

*Equal Contribution.

arXiv Code

Dataset Leaderboard 𝕏 Post

Mind2Web 2 features realistic and diverse long-horizon web search tasks and a novel Agent-as-a-Judge framework to evaluate complex, time-varying, and citation-backed answers.

Overview

We introduce **Mind2Web 2**, a benchmark of **130 realistic, high-quality, long-horizon tasks** that require real-time web browsing and **extensive information synthesis**, constructed with **over 1,000 hours of human labor**. Each task has undergone hours of expert labor for polishing and validation to ensure the validity, diversity, clarity, and verifiability of our benchmark. To tackle the significant evaluation challenge, we propose a novel [Agent-as-a-Judge framework](https://osu-nlp-group.github.io/Mind2Web-2/#evaluation) that evaluates an answer’s: - *correctness* (i.e., whether the answer satisfies all the requirements of the task) - *attribution* (i.e., whether each claim in the answer can be attributed to the cited source) We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70\% of human performance while spending half the time, showing a great potential. Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Leaderboard

We show the evaluation results of agentic search systems on the private test set (120 tasks), with the following metrics: - `Partial Completion`: The average root node score across all tasks, reflecting the partial satisfaction based on the fine-grained evaluation. - `Success Rate`: The percentage of tasks achieving a perfect root node score. - `Pass@3`: The percentage of tasks that achieve a perfect root node score within three attempts. We also include the the following statistics to further contextualize the system performance: - `Time (min)`: The average time taken by the system to complete the tasks, measured in minutes. - `Answer Length`: The average length of the answers generated by the system, measured in word count.

📥 **Submission:** Submitting to our benchmark couldn't be easier! Just collect your agent's answers and send them to us. We'll take care of all the evaluation steps for you, including the cost of running our Agent-as-a-Judge evaluation. Check out our submission guideline for more details. We welcome all submissions and look forward to your participation!

Comparisons with Existing Benchmarks

To better distinguish the difference between Mind2Web 2 and existing benchmarks, we compare it with other related benchmarks in the table below. Mind2Web 2 is the only benchmark to date focusing on long-horizon, time-varying agentic search tasks. It is worth noting that even though there are only 130 tasks, each task contains dozens to hundreds of fine-grained evaluation nodes, thus still providing sufficient differentiation power.

Name	Horizon	# of Tasks	Evaluation
Mind2Web 2	Long	130	Agent-as-a-Judge
Online-Mind2Web	Short	300	LLM-as-a-Judge
WebVoyager	Short	643	LLM-as-a-Judge
Mind2Web-Live	Short	542	Rule
BEARCUBS	Short	111	Answer Match
WebWalkerQA	Short	680	Answer Match
GAIA	Medium	466	Answer Match
AssistantBench	Medium	214	Answer Match
BrowseComp	Long	1,266	Answer Match

Comparison with existing benchmarks for web browsing or search on live websites. **Horizon**: average number of required actions per task. Short (< 10), Medium (10-50), Long (> 50). **Time-Varying**: whether the answer can change over time.

Tasks

The tasks in Mind2Web 2 have the following characteristics: (1) *Realistic and diverse*. Tasks reflect practical user needs across diverse domains, providing substantial real-world value when solved. (2) *Long-horizon and laborious*. Tasks require substantial human effort due to the extended length and breadth of the required searches. (3) *Objective and verifiable*. Each task has clearly defined evaluation criteria that are verifiable by checking the answer text in addition to the cited source webpages. (4) *Time-varying*. Many tasks have answers that could change over time, reflecting the dynamic nature of web information. The resulting benchmark contains 130 diverse tasks covering 6 broad domains and 24 sub-domains. Below we show an overview of the tasks. - Left: Figure shows the domain distribution of the tasks, you can click to expand the domains. - Right: We provide tasks as examples, you can select a task from the dropdown menu to view its description.

Domain Distribution

Example Tasks

Select a task from the dropdown to view its description.

Agent-as-a-Judge Evaluation

Agentic search systems typically produce long, time-varying answers ranging from hundreds to thousands of words on these tasks. The complexity is far beyond what conventional LLM-as-a-Judge methods are used for. Our proposed Agent-as-a-Judge framework can automatically yet reliably evaluate such complex answers. The key insight behind our evaluation methodology lies in the **generation-verification asymmetry**: *while the generated answers can vary substantially across agents, search strategies, or query times, we know **a priori** what each task is looking for and can design a **task-specific rubric** to specify the evaluation logic.* We propose a **tree-structured rubric** for evaluation. At a high level, a rubric evaluates two main aspects of an answer: *correctness* (i.e., whether the answer satisfies all the requirements of the task) and *attribution* (i.e., whether each fact in the answer can be attributed to the cited source). At the operational level, a rubric tree breaks down the evaluation into hierarchical evaluation nodes, where each leaf node conforms to a binary judgment and the internal nodes aggregate and propagate the results toward the root node.

Rubric Examples

Drag, zoom in/out, or hover over nodes for details.

Select a task from the dropdown to view its description and evaluation rubric.

Select a task to view its evaluation rubric.

Acknowledgements

The authors would like to thank colleagues from the OSU NLP group and Amazon AGI for constructive discussions and generous help, Zishuo Zheng for his exploration of developing long-horizon agentic search agents, Akshay Anand and Scott Salisbury for their help on benchmark construction, the Hugging Face team (Amir Mahla, Aymeric Roucher, Aksel Joonas Reedi, and Thomas Wolf) for their assistance with the evaluation of Hugging Face Open Deep Research as well as covering the inference costs, the Grok team (Piaoyang Cui, Hexiang Hu) for their assistance with the evaluation of Grok DeepResearch and DeeperResearch, and the Amazon AGI team for their valuable feedback and contribution to task collection. This research is sponsored in part by a gift from Amazon, ARL W911NF2220144, NSF CAREER 1942980, and NSF OAC 2112606.

BibTeX


        @misc{gou2025mind2web2,
            title = {Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge}, 
            author = {Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Weijian Qi and Andrei Kopanev and Botao Yu and Bernal Jiménez Gutiérrez and Yiheng Shu and Chan Hee Song and Jiaman Wu and Shijie Chen and Hanane Nour Moussa and Tianshu Zhang and Jian Xie and Yifei Li and Tianci Xue and Zeyi Liao and Kai Zhang and Boyuan Zheng and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
            year = {2025},
            eprint = {2506.21506},
            archivePrefix = {arXiv},
            primaryClass = {cs.AI}
        }