GUI Agents Papers
Star · 821

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang , Yubo Wang , Yipeng Zhu , Penghui Du , Junwen Miao , Xuan Lu , Wendong Xu , Yunzhuo Hao , Songcheng Cai , Xiaochen Wang , Huaisong Zhang , Xian Wu , Yi Lu , Minyi Lei , Kai Zou , Huifeng Yin , Ping Nie , Liang Chen , Dongfu Jiang , Wenhu Chen , Kelsey R. Allen

🏛 Institutions
UBC , Vector Institute , CMU , UWaterloo , SJTU , ZJU , HKUST , Tsinghua
📅 Date
April 9, 2026
📑 Publisher
arXiv
💻 Env
Web
🔑 Keywords
TLDR

ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.

Open paper arXiv Report issue
Related papers (24)