GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Jian Mu , Chaoyun Zhang , Chiming Ni , Lu Wang , Bo Qiao , Kartik Mathur , Qianhui Wu , Yuhang Xie , Xiaojun Ma , Mengyu Zhou , Si Qin , Liqun Li , Yu Kang , Minghua Ma , Qingwei Lin , Saravan Rajmohan , Dongmei Zhang

🏛 Institutions: Microsoft , NJU , ZJU-UIUC , PKU
📅 Date: November 6, 2025
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: dataset benchmark Windows accessibility metadata reasoning supervision GUI-360

TLDR

GUI-360 addresses the lack of large real-world CUA data and unified evaluation by releasing 1.2M+ executed action steps across thousands of trajectories in popular Windows office applications, including full-resolution screenshots, accessibility metadata, intermediate reasoning, and both successful and failed trajectories. It is the first corpus to jointly cover GUI grounding, screen parsing, action prediction, and API-level actions, exposing cascading failures of off-the-shelf VLMs on heterogeneous layouts.

Open paper arXiv Report issue