WebSuite: Systematically Evaluating Why Web Agents Fail
Eric Li, Jim Waldo
- 🏛 Institutions
- Harvard
- 📅 Date
- June 1, 2024
- 📑 Publisher
- arXiv
- 💻 Env
- Web
- 🔑 Keywords
TLDR
Introduces WebSuite, a diagnostic benchmark for understanding why web agents fail rather than only whether they fail. It organizes web behavior into a taxonomy of actions and builds both atomic and end-to-end tasks so failures can be traced back to specific action categories.
Related papers
- SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web AgentsOctober 11, 2025 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game AgentsApril 8, 2026 · arXiv