WebCanvas: Benchmarking Web Agents in Online Environments

Yichen Pan , Dehan Kong , Sida Zhou , Cheng Cui , Yifei Leng , Bing Jiang , Hangyu Liu , Yanyi Shang , Shuyan Zhou , Tongshuang Wu , Zhengyang Wu

🏛 Institutions: iMean AI , CMU
📅 Date: June 18, 2024
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark dataset Mind2Web-Live key-node evaluation WebCanvas

TLDR

WebCanvas is an online web-agent benchmark built to evaluate agents against live websites rather than static snapshots. It introduces key-node evaluation for progress-aware scoring, releases Mind2Web-Live with 542 tasks and 2,439 intermediate evaluation states, and provides tooling to annotate and maintain those tasks as the web changes.

Open paper arXiv Report issue