MobileFlow: A Multimodal LLM for Mobile GUI Agent

Songqin Nong , Jiali Zhu , Rui Wu , Jiongchao Jin , Shuo Shan , Xiutian Huang , Wenhao Xu

🏛 Institutions: Ant Group
📅 Date: July 5, 2024
📑 Publisher: arXiv
💻 Env: Mobile
🔑 Keywords: model hybrid visual encoders multilingual GUI Mixture of Experts GUI alignment MobileFlow

TLDR

MobileFlow adapts Qwen-VL-Chat into a 21B mobile GUI model with hybrid visual encoders, MoE expansion, and GUI-specific alignment and chain-of-thought training. The model is built to handle variable-resolution screens and multilingual interfaces without depending on system APIs for page layout access.

Open paper arXiv Report issue