GUICourse: From General Vision Language Model to Versatile GUI Agent

Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun

🏛 Institutions: Renmin University of China , Tsinghua , Xiamen University , Beijing University of Posts and Telecommunications , ModelBest , CAS , NUS , Shanghai Qi Zhi Institute
📅 Date: June 17, 2024
📑 Publisher: ACL 2025
💻 Env: General GUI
🔑 Keywords: dataset GUIEnv GUIAct GUIChat OCR and grounding

TLDR

GUICourse introduces a staged dataset suite for turning general vision-language models into GUI agents, with GUIEnv for OCR and grounding, GUIAct for GUI navigation, and GUIChat for GUI-related dialogue. The paper shows that these datasets let even a 3.1B model perform effectively on single-step and multi-step GUI tasks and transfer better to AITW and Mind2Web than the original VLM baselines.

Open paper Report issue