GUI Agents Papers
Star · 751

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun

🏛 Institutions
Huazhong University of Science and Technology, University of Notre Dame, MSR, Lehigh University
📅 Date
June 16, 2024
📑 Publisher
ICLR 2025 (Poster)
💻 Env
Desktop Mobile Web
🔑 Keywords
TLDR

GUI-World is a benchmark and dataset for GUI-oriented multimodal understanding built around dynamic video content rather than static screenshots. It covers six GUI scenarios and eight question types across desktop, mobile, and web settings, and shows that current image and video MLLMs still struggle without manually selected keyframes or operation history.

Open paper arXiv Edit on GitHub Report issue
Related papers