Mind2Web 2 Overview

Mind2Web 2 features realistic and diverse long-horizon web search tasks and a novel Agent-as-a-Judge framework to evaluate complex, time-varying, and citation-backed answers.

Overview

We introduce **Mind2Web 2**, a benchmark of **130 realistic, high-quality, long-horizon tasks** that require real-time web browsing and extensive information synthesis, constructed with **over 1,000 hours of human labor**. There are two main challenges in constructing such a benchmark: - How to collect sufficiently complex yet realistic tasks? - How to automatically and reliably evaluate the complex answers generated by different agentic search systems? To collect complex and realistic tasks, we adopt a [three-stage process](https://osu-nlp-group.github.io/Mind2Web-2/#task_collection) to propose, refine, and validate tasks. Each task has undergone hours of expert labor for polishing and validation to ensure the validity, diversity, clarity, and verifiability of our benchmark. To tackle the significant evaluation challenge, we propose a novel [Agent-as-a-Judge framework](https://osu-nlp-group.github.io/Mind2Web-2/#evaluation) that evaluates an answer’s: - *correctness* (i.e., whether the answer satisfies all the requirements of the task) - *attribution* (i.e., whether each claim in the answer can be attributed to the cited source) We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70\% of human performance while spending half the time, showing a great potential. Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Leaderboard

The leaderboard shows the performance of agentic systems on Mind2Web 2 tasks. The evaluation is based on our Agent-as-a-Judge framework, which automatically evaluates the correctness and attribution of answers. The leaderboard is updated periodically as new results are submitted. The metrics are defined as follows: - `Partial Completion`: The average root node score across all tasks, representing the percentage of passed fine-grained evaluation nodes. - `Success Rate`: The percentage of tasks completed with all criteria met. - `Pass@3`: The percentage of tasks successfully completed within three attempts. We also include the the following statistics to further contextualize the system performance: - `Time (min)`: The average time taken by the system to complete the tasks, measured in minutes. - `Answer Length`: The average length of the answers generated by the system, measured in word count. The results of the leaderboard are evaluated using a *Private Test Set* of 120 tasks. We provide a separate *Public Development Set*, consisting of 10 tasks with evaluation rubrics made publicly available. [[Code Coming Soon](https://github.com/osu-nlp-group/mind2web2)]
Please follow the codebase provided in our [repo](https://github.com/osu-nlp-group/mind2web2) to test your agent on Mind2Web 2. For leaderboard submission, [contact us](mailto:m2w2-leaderboard@googlegroups.com) with your organization name and agent's name. Submit both the agent responses on the *Private Test Set* and the corresponding cached web content to ensure evaluation consistency. We encourage the submission to also include the time taken to complete each task. **Note**: The cached web content can be large, so we recommend using a cloud storage service to store the content and provide a link in your submission.


Comparisons with Existing Benchmarks

To better distinguish the difference between Mind2Web 2 and existing benchmarks, we compare it with other related benchmarks in the table below. Our benchmark is the only agentic search benchmark to date focusing on long-horizon, time-varying tasks, and is made possible due to our advanced Agent-as-a-Judge evaluation methodology. It is worth noting that even though there are only 130 tasks, each task contains dozens to hundreds of fine-grained evaluation nodes, thus still providing sufficient differentiation power.
Name Horizon # of Tasks Time-Varying Evaluation
Mind2Web 2 Long 130 Agent-as-a-Judge
Online-Mind2Web Short 300 LLM-as-a-Judge
WebVoyager Short 643 LLM-as-a-Judge
Mind2Web-Live Short 542 Rule
BEARCUBS Short 111 Manual Evaluation
WebWalkerQA Short 680 Answer Match
GAIA Medium 466 Answer Match
AssistantBench Medium 214 Answer Match
BrowseComp Long 1,266 Answer Match
Comparison with existing benchmarks for web browsing or search on live websites. **Horizon**: average number of required actions per task. Short (< 10), Medium (10-50), Long (> 50). **Time-Varying**: whether the answer can change over time.


Task Collection

Our task collection consists of three stages: - **Stage 1, Task Proposal**: Task proposers independently generate task ideas based on their authentic search needs or inspirations from our provided domain guidelines, ensuring initial alignment with the realism and laboriousness desiderata. - **Stage 2, Task Refinement**: Trained refinement experts iteratively revise or filter tasks to enforce strict verifiability while collaborating closely with the original task proposers to maintain task relevance. - **Stage 3, Task Validation**: Experienced validation experts manually attempt and verify each refined task, ensuring feasibility, determinism, and clarity of all the evaluation criteria. Only tasks independently validated by at least two validation experts are included in Mind2Web 2. The resulting benchmark contains 130 diverse tasks covering 6 broad domains and 24 sub-domains. Below we show an overview of the tasks. - Left: Figure shows the domain distribution of the tasks, you can click to expand the domains. - Right: We provide tasks as examples, you can select a task from the dropdown menu to view its description.

Domain Distribution

Example Tasks

Select a task from the dropdown to view its description.



Agent-as-a-Judge Evaluation

Agentic search systems typically produce long, time-varying answers ranging from hundreds to thousands of words on these tasks. The complexity is far beyond what conventional LLM-as-a-Judge methods are used for. Our proposed Agent-as-a-Judge framework can automatically yet reliably evaluate such complex answers. The key insight behind our evaluation methodology lies in the **generation-verification asymmetry**: *while the generated answers can vary substantially across agents, search strategies, or query times, we know **a priori** what each task is looking for and can design a **task-specific rubric** to specify the evaluation logic.* We propose a **tree-structured rubric** for evaluation. At a high level, a rubric evaluates two main aspects of an answer: *correctness* (i.e., whether the answer satisfies all the requirements of the task) and *attribution* (i.e., whether each fact in the answer can be attributed to the cited source). At the operational level, a rubric tree breaks down the evaluation into hierarchical evaluation nodes, where each leaf node conforms to a binary judgment and the internal nodes aggregate and propagate the results toward the root node.

Rubric Examples

Drag, zoom in/out, or hover over nodes for details.

Select a task from the dropdown to view its description and evaluation rubric.

Select a task to view its evaluation rubric.

Acknowledgements

The authors would like to thank colleagues from the OSU NLP group and Amazon AGI for constructive discussions and generous help, Zishuo Zheng for his exploration of developing long-horizon agentic search agents, Akshay Anand and Scott Salisbury for their help on benchmark construction, the Hugging Face team (Amir Mahla, Aymeric Roucher, Aksel Joonas Reedi, and Thomas Wolf) for their assistance with the evaluation of Hugging Face Open Deep Research as well as covering the inference costs, the Grok team (Piaoyang Cui, Hexiang Hu) for their assistance with the evaluation of Grok DeepResearch and DeeperResearch, and the Amazon AGI team for their valuable feedback and contribution to task collection. This research is sponsored in part by a gift from Amazon, ARL W911NF2220144, NSF CAREER 1942980, and NSF OAC 2112606.

BibTeX


        @misc{gou2025mind2web2,
            title = {Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge}, 
            author = {Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Weijian Qi and Andrei Kopanev and Botao Yu and Bernal Jiménez Gutiérrez and Yiheng Shu and Chan Hee Song and Jiaman Wu and Shijie Chen and Hanane Nour Moussa and Tianshu Zhang and Jian Xie and Yifei Li and Tianci Xue and Zeyi Liao and Kai Zhang and Boyuan Zheng and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
            year = {2025},
            eprint = {2506.21506},
            archivePrefix = {arXiv},
            primaryClass = {cs.AI}
        }