Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong

🏛 Institutions: HKU, Salesforce AI Research
📅 Date: December 5, 2024
📑 Publisher: ICML 2025 (Poster)
💻 Env: General GUI
🔑 Keywords: model dataset pure vision inner monologue two-stage training Aguvis

TLDR

Aguvis is a pure-vision GUI agent that removes textual interface representations and operates directly on screen images. It combines a large grounding-and-reasoning dataset with a two-stage training pipeline and inner-monologue reasoning, reporting strong offline and online performance without relying on closed-source models.

Open paper Edit on GitHub Report issue