The Amazing Agent Race: Strong Tool Users, Weak Navigators
Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
- 🏛 Institutions
- University of Minnesota
- 📅 Date
- April 11, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Web
- 🔑 Keywords
TLDR
The Amazing Agent Race introduces 1,400 DAG-puzzle legs that require fork-merge tool chains over Wikipedia, distinguishing navigation from tool-use ability. The best agent reaches only 37.2%, with navigation errors dominating (27-52% of trials) while tool-use errors stay below 17%, revealing a navigation blind spot invisible to linear benchmarks.
Related papers
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game AgentsApril 8, 2026 · arXiv
- WebSP-Eval: Evaluating Web Agents on Website Security and Privacy TasksApril 7, 2026 · arXiv
- The Art of Building Verifiers for Computer Use AgentsApril 5, 2026 · arXiv