VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu , Tianjie Zhang , Yu Gu , Iat Long Iong , Yifan Xu , Xixuan Song , Shudan Zhang , Hanyu Lai , Xinyi Liu , Hanlin Zhao , Jiadai Sun , Xinyue Yang , Yu Yang , Zehan Qi , Shuntian Yao , Xueqiao Sun , Siyi Cheng , Qinkai Zheng , Hao Yu , Hanchen Zhang , Wenyi Hong , Ming Ding , Lihang Pan , Xiaotao Gu , Aohan Zeng , Zhengxiao Du , Chan Hee Song , Yu Su , Yuxiao Dong , Jie Tang
- 🏛 Institutions
- Tsinghua University , Zhejiang University , Peking University , The Ohio State University
- 📅 Date
- August 12, 2024
- 📑 Publisher
- ICLR 2025
- 💻 Env
- 🔑 Keywords
TLDR
VisualAgentBench benchmarks large multimodal models as general visual foundation agents across embodied tasks, GUI tasks, and visual design rather than focusing only on GUI interaction. It also releases trajectory data for behavior cloning, making it relevant to GUI work as a broader visual-agent benchmark rather than a direct GUI paper.
Related papers (24)
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction TracesMarch 5, 2026 · arXiv
- Turing Test on Screen: A Benchmark for Mobile GUI Agent HumanizationFebruary 24, 2026 · arXiv
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the WildFebruary 12, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe SynthesisJanuary 26, 2026 · arXiv
- SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy TasksJanuary 26, 2026 · ICLR 2026 (Poster)
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- MobileWorldBench: Towards Semantic World Modeling For Mobile AgentsDecember 16, 2025 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using AgentsNovember 6, 2025 · arXiv
- Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web AgentsAugust 3, 2025 · ICLR 2026 (Poster)
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM AgentsJune 9, 2025 · ICLR 2026 (Poster)
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- Efficient Agent Training for Computer UseMay 20, 2025 · ICLR 2026 (Poster)
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration BenchmarkApril 18, 2025 · arXiv