WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu , Wenbiao Yin , Yong Jiang , Zhenglin Wang , Zekun Xi , Runnan Fang , Linhai Zhang , Yulan He , Deyu Zhou , Pengjun Xie , Fei Huang

🏛 Institutions: Tongyi Lab , Alibaba Group
📅 Date: January 13, 2025
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark framework web traversal explore-critic WebWalkerQA WebWalker

TLDR

WebWalker studies web traversal for multi-layered information retrieval rather than shallow page lookup. It introduces the WebWalkerQA benchmark and an explore-critic multi-agent framework that improves traversal-based RAG in real-world website hierarchies.

Open paper arXiv Report issue

Related papers (24)

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

February 12, 2025 · arXiv
The BrowserGym Ecosystem for Web Agent Research

December 6, 2024 · TMLR
Grounding Open-Domain Instructions to Automate Web Support Tasks

March 30, 2021 · NAACL 2021
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

May 25, 2026 · arXiv
LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent

January 26, 2026 · ICLR 2026 (Poster)
GUITester: Enabling GUI Agents for Exploratory Defect Discovery

January 8, 2026 · arXiv
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent

May 22, 2025 · ACL 2025
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

April 18, 2025 · arXiv
You Only Look at Screens: Multimodal Chain-of-Action Agents

September 20, 2023 · Findings of ACL 2024
AutoDroid: LLM-powered Task Automation in Android

August 29, 2023 · MobiCom 2024
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models

May 30, 2023 · NeurIPS 2023
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

April 27, 2026 · arXiv
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

April 8, 2026 · arXiv
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

April 7, 2026 · arXiv
The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

April 1, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026 · arXiv
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

March 26, 2026 · arXiv
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

March 23, 2026 · CVPR 2026
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

March 18, 2026 · arXiv