ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin , Linjie Li , Difei Gao , Zhengyuan Yang , Shiwei Wu , Zechen Bai , Weixian Lei , Lijuan Wang , Mike Zheng Shou

🏛 Institutions: Show Lab , NUS , Microsoft
📅 Date: November 26, 2024
📑 Publisher: CVPR 2025 (Poster)
💻 Env: Mobile Web
🔑 Keywords: model dataset UI-guided visual token selection interleaved vision-language-action streaming screenshot grounding ShowUI

TLDR

ShowUI is a lightweight vision-language-action model for GUI visual agents that targets efficient screenshot perception and action-history modeling. It introduces UI-guided visual token selection and interleaved vision-language-action streaming, reaching 75.1% zero-shot screenshot grounding while remaining competitive on web and mobile GUI tasks.

Open paper arXiv Report issue