SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu

🏛 Institutions: National Key Laboratory for Novel Software Technology , NJU , Shanghai AI Laboratory
📅 Date: January 17, 2024
📑 Publisher: ACL 2024
💻 Env: Desktop Mobile Web
🔑 Keywords: benchmark dataset GUI grounding grounding pre-training ScreenSpot SeeClick

TLDR

SeeClick is a screenshot-only GUI agent built around the GUI grounding problem rather than structured trees such as HTML. The paper adds automated GUI-grounding data curation and introduces ScreenSpot, a grounding benchmark spanning mobile, desktop, and web environments.

Open paper arXiv Report issue

Related papers (24)

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025 · arXiv
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

December 2, 2024 · Findings of ACL 2025
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024 · ICLR 2025 (Spotlight)
TinyClick: Single-Turn Agent for Empowering GUI Automation

October 9, 2024 · INTERSPEECH 2025
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

October 7, 2024 · ICLR 2025 (Oral)
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

June 16, 2024 · ICLR 2025 (Poster)
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

November 7, 2025 · arXiv
NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

August 2, 2025 · arXiv
Scaling Computer‑Use Grounding via User Interface Decomposition and Synthesis

May 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

April 15, 2025 · Findings of ACL 2025
GUI Action Narrator: Where and When Did That Action Take Place?

June 19, 2024 · arXiv
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

February 29, 2024 · ECCV 2024 (Poster)
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
Gym-Anything: Turn any Software into an Agent Environment

April 7, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

March 31, 2026 · arXiv
SecAgent: Efficient Mobile GUI Agent with Semantic Context

March 9, 2026 · arXiv
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

March 5, 2026 · arXiv
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

February 24, 2026 · arXiv
AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

February 12, 2026 · arXiv
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

February 9, 2026 · arXiv
MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

February 3, 2026 · arXiv
SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis

January 26, 2026 · arXiv
SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks

January 26, 2026 · ICLR 2026 (Poster)