GUI Agents Papers
Star · 821

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

Qinzhuo Wu , Weikai Xu , Wei Liu , Tao Tan , Jianfeng Liu , Ang Li , Jian Luan , Bin Wang , Shuo Shang

🏛 Institutions
XiaoMi AI Lab , University of Electronic Science and Technology of China , Renmin University of China
📅 Date
September 23, 2024
📑 Publisher
Findings of EMNLP 2024
💻 Env
Mobile
🔑 Keywords
TLDR

MobileVLM is a mobile-focused vision-language model trained with two extra UI-specific pretraining stages designed to improve both intra-UI element understanding and inter-UI transition understanding. The paper also introduces the 3M-page Chinese mobile corpus Mobile3M with real transition-action graphs, and reports stronger performance than prior VLMs on in-house and public mobile benchmarks.

Open paper Report issue
Related papers (24)