WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan , Yuyang Yin , Yuxuan Cai , Zheng Wei

🏛 Institutions: Unknown
📅 Date: April 13, 2026
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark dataset automated environment generation WebForge WebForge-Bench

TLDR

WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.

Open paper arXiv Report issue