AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Hongxin Li , Xiping Wang , Jingran Su , Zheng Ju , Yuntao Chen , Qing Li , Zhaoxiang Zhang

🏛 Institutions: UCAS , CASIA , PolyU , Shanghai AI Laboratory
📅 Date: April 27, 2026
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: benchmark GUI functionality understanding region semantics state prediction AutoGUI-v2

TLDR

AutoGUI-v2 unifies region-level semantics, element grounding, and state prediction into 2,753 tasks spanning six operating systems, addressing the bifurcation between black-box task-completion and shallow grounding benchmarks. Open-source models excel at functional grounding while commercial models do better at functionality description, but all struggle with complex interaction logic in uncommon actions.

Open paper arXiv Report issue