CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong , Weihan Wang , Qingsong Lv , Jiazheng Xu , Wenmeng Yu , Junhui Ji , Yan Wang , Zihan Wang , Yuxiao Dong , Ming Ding , Jie Tang

🏛 Institutions: Tsinghua , Zhipu
📅 Date: December 14, 2023
📑 Publisher: CVPR 2024 (Highlight)
💻 Env: Mobile Web
🔑 Keywords: model dataset high-resolution GUI understanding dual-resolution encoders Mind2Web AITW CogAgent

TLDR

CogAgent is an 18B visual language model specialized for GUI understanding and navigation. It combines low- and high-resolution image encoders, trains on a large GUI-and-OCR dataset, and outperforms HTML-consuming baselines on Mind2Web and AITW using screenshots alone.

Open paper arXiv Report issue

Related papers (24)

SpiritSight Agent: Advanced GUI Agent with One Look

March 5, 2025 · CVPR 2025 (Poster)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

November 26, 2024 · CVPR 2025 (Poster)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024 · ICLR 2025 (Spotlight)
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

April 9, 2026 · arXiv
SecAgent: Efficient Mobile GUI Agent with Semantic Context

March 9, 2026 · arXiv
OpenCUA: Open Foundations for Computer-Use Agents

August 12, 2025 · NeurIPS 2025 (Spotlight)
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

May 27, 2025 · NeurIPS 2025 (Poster)
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

May 21, 2025 · NeurIPS 2025 (Spotlight)
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

September 23, 2024 · Findings of EMNLP 2024
Android in the Wild: A Large-Scale Dataset for Android Device Control

July 19, 2023 · NeurIPS 2023 Datasets and Benchmarks Track
Mind2Web: Towards a Generalist Agent for the Web

June 9, 2023 · NeurIPS 2023 Datasets and Benchmarks Track
Multimodal Web Navigation with Instruction-Finetuned Foundation Models

May 19, 2023 · ICLR 2024
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

September 29, 2022 · ICLR 2023 (Poster)
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

February 15, 2026 · arXiv
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

December 31, 2025 · arXiv
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

September 18, 2025 · ICLR 2026 (Oral)
UI-Venus Technical Report: Building High-performance UI Agents with RFT

August 14, 2025 · arXiv
Efficient Agent Training for Computer Use

May 20, 2025 · ICLR 2026 (Poster)
STEVE: A Step Verification Pipeline for Computer-use Agent Training

March 16, 2025 · arXiv
UI-TARS: Pioneering Automated GUI Interaction with Native Agents

January 21, 2025 · arXiv
Falcon-UI: Understanding GUI Before Following User Instructions

December 12, 2024 · arXiv
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

December 5, 2024 · ICML 2025 (Poster)
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

December 2, 2024 · Findings of ACL 2025
AutoGLM: Autonomous Foundation Agents for GUIs

October 28, 2024 · arXiv