Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
Sayash Kapoor , Benedikt Stroebl , Peter Kirgis , Nitya Nadgir , Zachary S. Siegel , Boyi Wei , Tianci Xue , Ziru Chen , Felix Chen , Saiteja Utpala , Franck Ndzomga , Dheeraj Oruganty , Sophie Luskin , Kangheng Liu , Botao Yu , Amit Arora , Dongyoon Hahm , Harsh Trivedi , Huan Sun , Juyong Lee , Tengjun Jin , Yifan Mai , Yifei Zhou , Yuxuan Zhu , Rishi Bommasani , Daniel Kang , Dawn Song , Peter Henderson , Yu Su , Percy Liang , Arvind Narayanan
- 🏛 Institutions
- Princeton University , Independent Researcher , The Ohio State University , Microsoft Research , Amazon , Georgetown University , KAIST , Stony Brook University , University of Illinois Urbana-Champaign , Stanford University , xAI , University of California , Berkeley
- 📅 Date
- October 13, 2025
- 📑 Publisher
- ICLR 2026 (Poster)
- 💻 Env
- 🔑 Keywords
HAL provides standardized infrastructure for evaluating agents across models, scaffolds, and benchmarks rather than introducing a new agent. It reports results from 21,730 rollouts across 9 models and 9 benchmarks, tracks costs and full traces, and uses LLM-aided log inspection to surface behaviors such as benchmark gaming and unsafe actions.