GUI Agents with Foundation Models: A Comprehensive Survey

Shuai Wang , Weiwen Liu , Jingxuan Chen , Yuqi Zhou , Weinan Gan , Xingshan Zeng , Yuhan Che , Shuai Yu , Xinlong Hao , Kun Shao , Bin Wang , Chuhan Wu , Yasheng Wang , Ruiming Tang , Jianye Hao

🏛 Institutions: Huawei Noah's Ark Lab
📅 Date: November 7, 2024
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: survey foundation models taxonomy industrial applications

TLDR

This survey organizes foundation-model GUI agents around data resources, agent construction, taxonomy, and industrial applications. It also summarizes open challenges around the benchmark-reality gap, agent self-evolution, and inference efficiency.

Open paper arXiv Report issue

Related papers (24)

Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms

November 17, 2024 · arXiv
How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction

February 12, 2026 · arXiv
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

May 16, 2025 · arXiv
A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

April 29, 2025 · arXiv
Towards Trustworthy GUI Agents: A Survey

March 30, 2025 · arXiv
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use

December 20, 2024 · ACL 2025
GUI Agents: A Survey

December 18, 2024 · Findings of ACL 2025
Mapping the Design Space of User Experience for Computer Use Agents

February 7, 2026 · IUI 2026
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

April 28, 2025 · TMLR 2025
A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

March 30, 2025 · KDD 2025
WebSuite: Systematically Evaluating Why Web Agents Fail

June 1, 2024 · arXiv
Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

June 12, 2026 · arXiv
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

June 2, 2026 · arXiv
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

June 1, 2026 · arXiv
GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

May 29, 2026 · arXiv
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

May 18, 2026 · arXiv
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

May 14, 2026 · arXiv
Executable Agentic Memory for GUI Agent

May 12, 2026 · arXiv
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

May 8, 2026 · arXiv
Step-level Optimization for Efficient Computer-use Agents

April 29, 2026 · arXiv
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

April 28, 2026 · arXiv
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

April 27, 2026 · arXiv
Human-Guided Harm Recovery for Computer Use Agents

April 20, 2026 · arXiv
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

April 15, 2026 · arXiv