Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
Fei Tang , Yongliang Shen , Hang Zhang , Siqi Chen , Guiyang Hou , Wenqi Zhang , Wenqiao Zhang , Kaitao Song , Weiming Lu , Yueting Zhuang
- 🏛 Institutions
- ZJU , MSR Asia
- 📅 Date
- March 9, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
Focus is a GUI grounding model that switches between fast prediction and slower analysis depending on task complexity. It decomposes grounding into summarization, focused visual analysis, and coordinate prediction, and reaches strong ScreenSpot and ScreenSpot-Pro performance with a 2B model trained on 300K examples.
Related papers (24)
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time GroundingJuly 29, 2025 · CVPR 2026 Findings
- Aria-UI: Visual Grounding for GUI InstructionsDecember 20, 2024 · Findings of ACL 2025
- AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agentNovember 30, 2025 · WACV 2026
- UI-TARS: Pioneering Automated GUI Interaction with Native AgentsJanuary 21, 2025 · arXiv
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Ponder & Press: Advancing Visual GUI Agent towards General Computer ControlDecember 2, 2024 · Findings of ACL 2025
- OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsOctober 30, 2024 · ICLR 2025 (Spotlight)
- GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement LearningMay 29, 2026 · arXiv
- Training Computer Use Agents to Assess the Usability of Graphical User InterfacesApril 28, 2026 · arXiv
- UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI GroundingApril 15, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual FeedbackApril 14, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- Towards GUI Agents: Vision-Language Diffusion Models for GUI GroundingMarch 27, 2026 · CVPR 2026
- AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction RefinementMarch 18, 2026 · arXiv
- Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface ElementsMarch 15, 2026 · arXiv
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal FusionFebruary 6, 2026 · arXiv
- POINTS-GUI-G: GUI-Grounding JourneyFebruary 6, 2026 · arXiv
- SSL: Sweet Spot Learning for Differentiated Guidance in Agentic OptimizationJanuary 30, 2026 · arXiv
- V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center PeakingJanuary 11, 2026 · arXiv
- MVP: Multiple View Prediction Improves GUI GroundingDecember 9, 2025 · arXiv
- Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI GroundingDecember 5, 2025 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv