VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking

Shunyu Liu , Minghao Liu , Huichi Zhou , Zhenyu Cui , Yang Zhou , Yuhao Zhou , Jialiang Gao , Heng Zhou , Yunhao Yang , Wendong Fan , puzhen zhang , Ge Zhang , Jiajun Shi , Weihao Xuan , Jiaxing Huang , Shuang Luo , Fang Wu , Heli Qi , Qingcheng Zeng , Junjie Wang , Aosong Feng , Jindi Lv , Sicong Jiang , Ziqi Ren , Wangchunshu Zhou , Zhenfei Yin , Wenlong Zhang , Guohao Li , Wenhao Yu , Lei Ma , Lei Bai , Qunshu Lin , Mingli Song , Dacheng Tao

🏛 Institutions: NTU , ZJU , University of Tokyo , Shanghai AI Laboratory , Google DeepMind , University of Alberta
📅 Date: August 6, 2025
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: long-chain web benchmark subtask-level verifiability breadth-and-depth search human demonstrations VeriWeb

TLDR

VeriWeb is a web benchmark for long-chain information-seeking tasks that decomposes each problem into interdependent, verifiable subtasks instead of relying only on final-answer checks. It contains 302 human-annotated tasks across five domains and is designed to stress both coverage-oriented search and multi-hop context tracking in realistic web environments.

Open paper arXiv Report issue