Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su

🏛 Institutions: OSU, Orby AI
📅 Date: October 7, 2024
📑 Publisher: ICLR 2025 (Oral)
💻 Env: Desktop Mobile Web
🔑 Keywords: dataset GUI grounding vision-only agents cross-platform grounding UGround synthetic data

TLDR

This paper introduces UGround, a universal GUI visual grounding model trained on 10M element-expression pairs over 1.3M screenshots from web, mobile, and desktop interfaces. It argues for vision-only GUI agents with pixel-level actions and shows that UGround improves grounding, offline-agent, and online-agent performance across six benchmarks.

Open paper Edit on GitHub Report issue