WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
- 🏛 Institutions
- Unknown
- 📅 Date
- April 13, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Web
- 🔑 Keywords
TLDR
WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.
Related papers
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction TracesMarch 5, 2026 · arXiv
- Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web AgentsAugust 3, 2025 · ICLR 2026 (Poster)
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World UsersApril 14, 2025 · AAAI 2026
- AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesApril 11, 2025 · COLM 2025