Aria-UI: Visual Grounding for GUI Instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li

🏛 Institutions: HKU, Salesforce AI Research, Alibaba Group, Australian National University, Independent Researcher
📅 Date: December 20, 2024
📑 Publisher: Findings of ACL 2025
💻 Env: General GUI
🔑 Keywords: model GUI grounding pure vision instruction synthesis context-aware grounding Aria-UI

TLDR

Aria-UI is a GUI-grounding model that deliberately avoids HTML or AXTree inputs and instead works from pure visual observations. It pairs a scalable instruction-synthesis pipeline with interleaved textual and text-image action histories for context-aware grounding, and reports state-of-the-art results across offline and online grounding benchmarks.

Open paper Edit on GitHub Report issue