MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

Qinzhuo Wu , Weikai Xu , Wei Liu , Tao Tan , Jianfeng Liu , Ang Li , Jian Luan , Bin Wang , Shuo Shang

🏛 Institutions: XiaoMi AI Lab , University of Electronic Science and Technology of China , Renmin University of China
📅 Date: September 23, 2024
📑 Publisher: Findings of EMNLP 2024
💻 Env: Mobile
🔑 Keywords: model dataset Mobile3M intra-UI understanding inter-UI understanding MobileVLM

TLDR

MobileVLM is a mobile-focused vision-language model trained with two extra UI-specific pretraining stages designed to improve both intra-UI element understanding and inter-UI transition understanding. The paper also introduces the 3M-page Chinese mobile corpus Mobile3M with real transition-action graphs, and reports stronger performance than prior VLMs on in-house and public mobile benchmarks.

Open paper Report issue