GUICourse: From General Vision Language Model to Versatile GUI Agent
Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun
- 🏛 Institutions
- Renmin University of China , Tsinghua , Xiamen University , Beijing University of Posts and Telecommunications , ModelBest , CAS , NUS , Shanghai Qi Zhi Institute
- 📅 Date
- June 17, 2024
- 📑 Publisher
- ACL 2025
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
GUICourse introduces a staged dataset suite for turning general vision-language models into GUI agents, with GUIEnv for OCR and grounding, GUIAct for GUI navigation, and GUIChat for GUI-related dialogue. The paper shows that these datasets let even a 3.1B model perform effectively on single-step and multi-step GUI tasks and transfer better to AITW and Mind2Web than the original VLM baselines.
Related papers (24)
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent PretrainingMay 14, 2026 · arXiv
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- VideoAgentTrek: Computer Use Pretraining from Unlabeled VideosOctober 22, 2025 · arXiv
- Scaling Synthetic Task Generation for Agents via ExplorationSeptember 29, 2025 · ICLR 2026 (Poster)
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI AgentsApril 17, 2025 · AAAI 2026
- UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction SynthesisApril 15, 2025 · Findings of ACL 2025
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task SynthesisDecember 27, 2024 · ACL 2025
- Falcon-UI: Understanding GUI Before Following User InstructionsDecember 12, 2024 · arXiv
- Aguvis: Unified Pure Vision Agents for Autonomous GUI InteractionDecember 5, 2024 · ICML 2025 (Poster)
- EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic DataOctober 25, 2024 · arXiv
- OmniParser for Pure Vision Based GUI AgentAugust 1, 2024 · arXiv
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens GroundingJune 27, 2024 · EMNLP 2024 (Poster)
- VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-TuningJune 20, 2024 · Findings of EMNLP 2024
- ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingFebruary 7, 2024 · IJCAI 2024
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsMay 30, 2023 · NeurIPS 2023
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use AgentsMarch 25, 2026 · arXiv