AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran , Samuel Joseph Amouyal , Chaitanya Malaviya , Ben Bogin , Ofir Press , Jonathan Berant

🏛 Institutions: Tel Aviv University , University of Pennsylvania , Allen Institute for AI , University of Washington , Princeton
📅 Date: October 21, 2024
📑 Publisher: EMNLP 2024 (Poster)
💻 Env: Web
🔑 Keywords: benchmark realistic website AssistantBench long-horizon tasks SPA

TLDR

Introduces AssistantBench, a benchmark of 214 realistic and time-consuming web tasks that require sustained planning, retrieval, and synthesis rather than short web interactions. The paper also proposes the SPA agent and shows that even strong models still struggle on these open-web tasks.

Open paper Report issue

Related papers (24)

ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
An Illusion of Progress? Assessing the Current State of Web Agents

April 2, 2025 · COLM 2025
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

June 9, 2026 · arXiv
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

May 24, 2026 · arXiv
CocoaBench: Evaluating Unified Digital Agents in the Wild

April 13, 2026 · arXiv
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

April 10, 2026 · arXiv
Gym-Anything: Turn any Software into an Agent Environment

April 7, 2026 · arXiv
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

March 19, 2026 · arXiv
OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

January 28, 2026 · arXiv
MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

January 28, 2026 · arXiv
LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent

January 26, 2026 · ICLR 2026 (Poster)
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

December 22, 2025 · arXiv
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

April 27, 2026 · arXiv
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

April 8, 2026 · arXiv
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

April 7, 2026 · arXiv
The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

April 1, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026 · arXiv
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

March 26, 2026 · arXiv
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

March 23, 2026 · CVPR 2026
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

March 18, 2026 · arXiv