Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Fei Tang , Yongliang Shen , Hang Zhang , Siqi Chen , Guiyang Hou , Wenqi Zhang , Wenqiao Zhang , Kaitao Song , Weiming Lu , Yueting Zhuang

🏛 Institutions: ZJU , MSR Asia
📅 Date: March 9, 2025
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: model GUI grounding dual-system cognition adaptive system switching progressive decomposition Focus

TLDR

Focus is a GUI grounding model that switches between fast prediction and slower analysis depending on task complexity. It decomposes grounding into summarization, focused visual analysis, and coordinate prediction, and reaches strong ScreenSpot and ScreenSpot-Pro performance with a 2B model trained on 300K examples.

Open paper arXiv Report issue