GUI Agents Papers
Star · 821

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Sayash Kapoor , Benedikt Stroebl , Peter Kirgis , Nitya Nadgir , Zachary S. Siegel , Boyi Wei , Tianci Xue , Ziru Chen , Felix Chen , Saiteja Utpala , Franck Ndzomga , Dheeraj Oruganty , Sophie Luskin , Kangheng Liu , Botao Yu , Amit Arora , Dongyoon Hahm , Harsh Trivedi , Huan Sun , Juyong Lee , Tengjun Jin , Yifan Mai , Yifei Zhou , Yuxuan Zhu , Rishi Bommasani , Daniel Kang , Dawn Song , Peter Henderson , Yu Su , Percy Liang , Arvind Narayanan

🏛 Institutions
Princeton University , Independent Researcher , The Ohio State University , Microsoft Research , Amazon , Georgetown University , KAIST , Stony Brook University , University of Illinois Urbana-Champaign , Stanford University , xAI , University of California , Berkeley
📅 Date
October 13, 2025
📑 Publisher
ICLR 2026 (Poster)
💻 Env
🔑 Keywords
TLDR

HAL provides standardized infrastructure for evaluating agents across models, scaffolds, and benchmarks rather than introducing a new agent. It reports results from 21,730 rollouts across 9 models and 9 benchmarks, tracks costs and full traces, and uses LLM-aided log inspection to surface behaviors such as benchmark gaming and unsafe actions.

Open paper arXiv Report issue
Related papers (1)