UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak , Xiangru Jian , Kevin Qinghong Lin , Juan A. Rodriguez , Montek Kalsi , Rabiul Awal , Nicolas Chapados , M. Tamer Özsu , Aishwarya Agrawal , David Vazquez , Christopher Pal , Perouz Taslakian , Spandana Gella , Sai Rajeswar

🏛 Institutions: Mila , Université de Montréal , ServiceNow , University of Waterloo , NUS , École de Technologie Supérieure , Polytechnique Montréal
📅 Date: March 19, 2025
📑 Publisher: ICML 2025 (Poster)
💻 Env: Desktop
🔑 Keywords: benchmark dataset UI-Vision element grounding layout grounding action prediction drag-and-drop spatial reasoning

TLDR

UI-Vision is a desktop GUI benchmark with dense human-demonstration annotations over 83 applications, covering element grounding, layout grounding, and action prediction. It exposes persistent weaknesses of current agents on professional software, spatial reasoning, and actions such as drag-and-drop, while providing an open benchmark for desktop-centric GUI evaluation.

Open paper arXiv Report issue