AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
- 🏛 Institutions
- Tsinghua University, The Ohio State University, ByteDance
- 📅 Date
- January 1, 2024
- 📑 Publisher
- ICLR 2024
- 💻 Env
- 🔑 Keywords
TLDR
Introduces AgentBench, a broad benchmark suite for evaluating LLMs as agents across eight environments, including but not limited to OS interaction. It belongs in this repo mainly as a general agent-evaluation reference rather than a pure GUI benchmark.
Related papers
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- An Illusion of Progress? Assessing the Current State of Web AgentsApril 2, 2025 · COLM 2025
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv