ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

🏛 Institutions: Show Lab, NUS, Microsoft
📅 Date: November 26, 2024
📑 Publisher: CVPR 2025 (Poster)
💻 Env: Mobile Web
🔑 Keywords: model dataset UI-guided visual token selection interleaved vision-language-action streaming screenshot grounding ShowUI

TLDR

ShowUI is a lightweight vision-language-action model for GUI visual agents that targets efficient screenshot perception and action-history modeling. It introduces UI-guided visual token selection and interleaved vision-language-action streaming, reaching 75.1% zero-shot screenshot grounding while remaining competitive on web and mobile GUI tasks.

Open paper Edit on GitHub Report issue