GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Chen Chen , Jiawei Shao , Dakuan Lu , Haoyi Hu , Xiangcheng Liu , Hantao Yao , Wu Liu

🏛 Institutions: USTC , Institute of Artificial Intelligence (TeleAI) , China Telecom , Shanghai Innovation Institute , SJTU
📅 Date: January 14, 2026
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: reinforcement learning active perception tool-augmented perception ScreenSpot-pro GUI-Eyes

TLDR

GUI-Eyes frames GUI grounding as active perception, letting the agent learn when and how to call tools such as cropping and zooming inside a two-stage reasoning process. It pairs that policy with a spatially continuous reward for tool use and reaches 44.8% grounding accuracy on ScreenSpot-Pro using only 3k labeled samples.

Open paper arXiv Report issue