GUI Agents Papers
Star · 751

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

🏛 Institutions
UBC, Vector Institute, CMU, UWaterloo, SJTU, ZJU, HKUST, Tsinghua
📅 Date
April 9, 2026
📑 Publisher
arXiv
💻 Env
Web
🔑 Keywords
TLDR

ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.

Open paper arXiv Edit on GitHub Report issue
Related papers