AgentBench: Evaluating LLMs as Agents
Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang
- 🏛 Institutions
- Tsinghua University , The Ohio State University , ByteDance
- 📅 Date
- January 1, 2024
- 📑 Publisher
- ICLR 2024
- 💻 Env
- 🔑 Keywords
TLDR
Introduces AgentBench, a broad benchmark suite for evaluating LLMs as agents across eight environments, including but not limited to OS interaction. It belongs in this repo mainly as a general agent-evaluation reference rather than a pure GUI benchmark.
Related papers (24)
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- An Illusion of Progress? Assessing the Current State of Web AgentsApril 2, 2025 · COLM 2025
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- Benchmarking Living-Screen-Native GUI Agents on Short-Video PlatformsJune 3, 2026 · arXiv
- AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source ApplicationsMay 26, 2026 · arXiv
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchMay 25, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search SystemsApril 9, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game AgentsApril 8, 2026 · arXiv
- WebSP-Eval: Evaluating Web Agents on Website Security and Privacy TasksApril 7, 2026 · arXiv