AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- 🏛 Institutions
- Tel Aviv University, University of Pennsylvania, Allen Institute for AI, University of Washington, Princeton
- 📅 Date
- October 21, 2024
- 📑 Publisher
- EMNLP 2024 (Poster)
- 💻 Env
- Web
- 🔑 Keywords
TLDR
Introduces AssistantBench, a benchmark of 214 realistic and time-consuming web tasks that require sustained planning, retrieval, and synthesis rather than short web interactions. The paper also proposes the SPA agent and shows that even strong models still struggle on these open-web tasks.
Related papers
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- An Illusion of Progress? Assessing the Current State of Web AgentsApril 2, 2025 · COLM 2025
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv