GUI Agents Papers
Star · 751

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei

🏛 Institutions
Unknown
📅 Date
April 13, 2026
📑 Publisher
arXiv
💻 Env
Web
🔑 Keywords
TLDR

WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.

Open paper arXiv Edit on GitHub Report issue
Related papers