ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang , Yubo Wang , Yipeng Zhu , Penghui Du , Junwen Miao , Xuan Lu , Wendong Xu , Yunzhuo Hao , Songcheng Cai , Xiaochen Wang , Huaisong Zhang , Xian Wu , Yi Lu , Minyi Lei , Kai Zou , Huifeng Yin , Ping Nie , Liang Chen , Dongfu Jiang , Wenhu Chen , Kelsey R. Allen

🏛 Institutions: UBC , Vector Institute , CMU , UWaterloo , SJTU , ZJU , HKUST , Tsinghua
📅 Date: April 9, 2026
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark realistic website long-horizon tasks ClawBench

TLDR

ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.

Open paper arXiv Report issue