GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun
- 🏛 Institutions
- Huazhong University of Science and Technology , University of Notre Dame , MSR , Lehigh University
- 📅 Date
- June 16, 2024
- 📑 Publisher
- ICLR 2025 (Poster)
- 💻 Env
- Desktop Mobile Web
- 🔑 Keywords
TLDR
GUI-World is a benchmark and dataset for GUI-oriented multimodal understanding built around dynamic video content rather than static screenshots. It covers six GUI scenarios and eight question types across desktop, mobile, and web settings, and shows that current image and video MLLMs still struggle without manually selected keyframes or operation history.
Related papers (24)
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- GUI Action Narrator: Where and When Did That Action Take Place?June 19, 2024 · arXiv
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and WebFebruary 29, 2024 · ECCV 2024 (Poster)
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction TracesMarch 5, 2026 · arXiv
- Turing Test on Screen: A Benchmark for Mobile GUI Agent HumanizationFebruary 24, 2026 · arXiv
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the WildFebruary 12, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe SynthesisJanuary 26, 2026 · arXiv
- SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy TasksJanuary 26, 2026 · ICLR 2026 (Poster)
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv
- MobileWorldBench: Towards Semantic World Modeling For Mobile AgentsDecember 16, 2025 · arXiv
- GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using AgentsNovember 6, 2025 · arXiv
- Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web AgentsAugust 3, 2025 · ICLR 2026 (Poster)
- FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM AgentsJune 9, 2025 · ICLR 2026 (Poster)
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)