Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Sayash Kapoor , Benedikt Stroebl , Peter Kirgis , Nitya Nadgir , Zachary S. Siegel , Boyi Wei , Tianci Xue , Ziru Chen , Felix Chen , Saiteja Utpala , Franck Ndzomga , Dheeraj Oruganty , Sophie Luskin , Kangheng Liu , Botao Yu , Amit Arora , Dongyoon Hahm , Harsh Trivedi , Huan Sun , Juyong Lee , Tengjun Jin , Yifan Mai , Yifei Zhou , Yuxuan Zhu , Rishi Bommasani , Daniel Kang , Dawn Song , Peter Henderson , Yu Su , Percy Liang , Arvind Narayanan

🏛 Institutions: Princeton University , Independent Researcher , The Ohio State University , Microsoft Research , Amazon , Georgetown University , KAIST , Stony Brook University , University of Illinois Urbana-Champaign , Stanford University , xAI , University of California , Berkeley
📅 Date: October 13, 2025
📑 Publisher: ICLR 2026 (Poster)
💻 Env
🔑 Keywords: evaluation infrastructure leaderboard evaluation harness cost tracking log inspection agent traces HAL

TLDR

HAL provides standardized infrastructure for evaluating agents across models, scaffolds, and benchmarks rather than introducing a new agent. It reports results from 21,730 rollouts across 9 models and 9 benchmarks, tracks costs and full traces, and uses LLM-aided log inspection to surface behaviors such as benchmark gaming and unsafe actions.

Open paper arXiv Report issue