Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang
- 🏛 Institutions
- ZJU, MSR Asia
- 📅 Date
- March 9, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
Focus is a GUI grounding model that switches between fast prediction and slower analysis depending on task complexity. It decomposes grounding into summarization, focused visual analysis, and coordinate prediction, and reaches strong ScreenSpot and ScreenSpot-Pro performance with a 2B model trained on 300K examples.
Related papers
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time GroundingJuly 29, 2025 · CVPR 2026 Findings
- Aria-UI: Visual Grounding for GUI InstructionsDecember 20, 2024 · Findings of ACL 2025
- AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agentNovember 30, 2025 · WACV 2026
- UI-TARS: Pioneering Automated GUI Interaction with Native AgentsJanuary 21, 2025 · arXiv
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Ponder & Press: Advancing Visual GUI Agent towards General Computer ControlDecember 2, 2024 · Findings of ACL 2025