BEARCUBS: A benchmark for computer-using web agents

Yixiao Song , Katherine Thai , Chau Minh Pham , Yapei Chang , Mazin Nadaf , Mohit Iyyer

🏛 Institutions: UMass Amherst , UMD
📅 Date: March 10, 2025
📑 Publisher: COLM 2025
💻 Env: Web
🔑 Keywords: benchmark information seeking live web content multimodal interactions BEARCUBS

TLDR

BEARCUBS is a benchmark of 111 information-seeking questions that require web agents to operate on live websites instead of static replicas. Its tasks force multimodal interactions such as video understanding and 3D navigation, and each question comes with a short answer and human-validated browsing trajectory for transparent evaluation.

Open paper arXiv Report issue