BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen , Xueguang Ma , Shengyao Zhuang , Ping Nie , Kai Zou , Andrew Liu , Joshua Green , Kshama Patel , Ruoxi Meng , Mingyi Su , Sahel Sharifymoghaddam , Yanxi Li , Haoran Hong , Xinyu Shi , Xuye Liu , Nandan Thakur , Crystina Zhang , Luyu Gao , Wenhu Chen , Jimmy Lin

🏛 Institutions: University of Waterloo , CSIRO , Independent , Carnegie Mellon University , The University of Queensland
📅 Date: August 8, 2025
📑 Publisher: arXiv
💻 Env
🔑 Keywords: benchmark dataset agentic search deep research BrowseComp-plus

TLDR

Introduces **BrowseComp-Plus**, a fixed-corpus benchmark for evaluating deep-research agents. It enables controlled, fair, and transparent comparisons by providing human-verified supporting and challenging negative documents for each query. Results reveal significant performance variation—for example, an open-source model (Search-R1 + BM25) only achieves 3.86% accuracy, while GPT-5 reaches 55.9%, and GPT-5 with Qwen3-Embedding-8B retriever achieves 70.1% with fewer queries—highlighting the critical importance of retrieval quality and enabling disentangled analysis of retrieval vs. reasoning components.

Open paper arXiv Report issue