UGround Homepage

TLDR: 1) Low-cost, scalable and effective synthetic data for GUI visaul grounding 2) SOTA GUI visual grounding model UGround 3) human-like purely vision-only (modular) GUI agent framework SeeAct-V 4) first time demonstrating practical SOTA performance of vision-only GUI agents.

UGround is a universal visual grounding model for locating the element of an action by pixel coordinates on GUIs. It is trained on 10M elements from 1.3M screenshots, and substantially outperforms previous SOTA GUI visual grounding models on ScreenSpot (web, mobile, desktop).
Different from prevalent approaches that rely on HTML/accessibility trees for observation or grounding, we propose a generic framework, SeeAct-V, that perceives the GUIs entirely visually, and takes pixel-level operations on screens. SeeAct-V Agents with UGround achieve SOTA performance on five benchmarks, spanning offline agent evaluation (web, mobile, desktop), and online agent evaluation (web, mobile):

MY ALT TEXT

Updates

2025/01/23: Our training data for the UGround-V1 series (Initial/Qwen2-VL) has been released. We also have provided a comprehensive evaluation suite packed with meaningful resources to help researchers test GUI Agents and grounding models with ease. Try them out! The performance of Qwen2-VL-based UGround-V1 on several benchmarks are also updated here.
2025/01/03: Qwen2-VL-based UGround-V1 has been released (2B, 7B, 72B). Check thier performance in Main Results.
2024/10/07: Preprint is arXived.
2024/08/06: Website is live. The initial manuscript and results are available.

In this work, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI (SeeAct-V framework). The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements (~95% from web) and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents.

The high-quality grounding data synthesized from web (9M elements from Web-Hybrid) effectively helps UGround generalize to Desktop and Mobile UIs, making UGround outperform previous SOTA SeeClick on every platform and element type on ScreenSpot.

ScreenSpot (Standard Setting)
Model	Mobile		Desktop		Web		Average
Model	Text	Icon/Widget	Text	Icon/Widget	Text	Icon/Widget	Average
InternVL-2-4B	9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4	22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o	20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser	93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround (Initial)	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D	85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B (Qwen2-VL)	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B	92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude (Computer-Use)	98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner							84.0
CogAgent-9B-20241220							85.4
UGround-V1-7B (Qwen2-VL)	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B (Qwen2-VL)	94.1	83.4	94.9	85.7	90.4	87.9	89.4

ScreenSpot (Agent Setting)
Planner	Grounding	Mobile		Desktop		Web		Average
Planner	Grounding	Text	Icon/Widget	Text	Icon/Widget	Text	Icon/Widget	Average
GPT-4o	Qwen-VL	21.3	21.4	18.6	10.7	9.1	5.8	14.5
	Qwen-GUI	67.8	24.5	53.1	16.4	50.4	18.5	38.5
	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
	OS-Atlas-Base-4B	94.1	73.8	77.8	47.1	86.5	65.3	74.1
	UGround (Initial)	93.4	76.9	92.8	67.9	88.7	68.9	81.4
	UGround-V1-2B (Qwen2-VL)	94.1	77.7	92.8	63.6	90.0	70.9	81.5
	Molmo-72B	94.1	79.0	92.3	70.0	88.7	67.0	81.9
	Molmo-7B-D	93.4	80.8	91.2	72.9	88.7	69.4	82.7
	OS-Atlas-Base-7B	93.8	79.9	90.2	66.4	92.6	79.1	83.7
	UGround-V1-7B (Qwen2-VL)	94.1	79.9	93.3	73.6	89.6	73.3	84.0
	UGround-V1-72B (Qwen2-VL)	94.5	79.9	93.8	75.0	88.7	75.2	84.5

Multimodal-Mind2Web (Element Accuracy)
Input	Planner	Grounding	Cross-Task	Cross-Website	Cross-Domain	Average
Image + Text	GPT-4	Choice	46.4	38.0	42.4	42.3
Image + Text	GPT-4	SoM	29.6	20.1	27.0	25.6
Image (SeeAct-V)	GPT-4	SeeClick	29.7	28.5	30.7	29.6
	GPT-4	UGround (Initial)	45.1	44.7	44.6	44.8
	GPT-4o	SeeClick	32.1	33.1	33.5	32.9
		UGround (Initial)	47.7	46.0	46.6	46.8
		UGround-V1-2B (Qwen2-VL)	48.6	47.6	47.7	48.0
		UGround-V1-7B (Qwen2-VL)	50.7	48.1	48.5	49.1

AndroidControl
Input	Planner	Grounding	Step Accuracy
Input	Planner	Grounding	High	Low
Text	GPT-4	Choice	42.1	55.0
Image (SeeAct-V)	GPT-4	SeeClick	39.4	47.2
	GPT-4	UGround (Initial)	46.2	58.0
	GPT-4o	SeeClick	41.8	52.8
		UGround (Initial)	48.4	62.4
		UGround-V1-2B (Qwen2-VL)	50.0	65.0
		UGround-V1-7B (Qwen2-VL)	49.8	66.2

OmniACT
Inputs	Planner	Grounding	Action Score
Text	GPT-4	DetACT	11.6
Image + Text	GPT-4	DetACT	17.0
Image (SeeAct-V)	GPT-4	SeeClick	28.9
	GPT-4	UGround (Initial)	31.1
	GPT-4o	SeeClick	29.6
		UGround (Initial)	32.8
		UGround-V1-2B (Qwen2-VL)	32.9
		UGround-V1-7B (Qwen2-VL)	34.0

Mind2Web-Live
Inputs	Planner	Grounding	Completion Rate	Task Success Rate
Text	GPT-4	Choice	44.3	21.1
Text	GPT-4o	Choice	47.6	22.1
Image (SeeAct-V)	GPT-4	UGround (Initial)	50.7	23.1
Image (SeeAct-V)	GPT-4o	UGround (Initial)	50.8	19.2

AndroidWorld
Input	Planner	Grounding	Task Success Rate
Text	GPT-4	Choice	30.6
Image + Text	GPT-4	SoM	25.4
Image (SeeAct-V)	GPT-4	UGround (Initial)	31.0
	GPT-4o	UGround (Initial)	32.8
	GPT-4o	UGround-V1-7B (Qwen2-VL)	44.0

BibTeX


@inproceedings{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for {GUI} Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=kxnoqaisCT}
}

Demo & Results

UGround

Navigating the Digital World as Humans Do:
Universal Visual Grounding for GUI Agents

ICLR'25 Oral (1.8%)

SeeAct-V and UGround

Overview

Offline Experiments

Online Experiments

BibTeX

Demo & Results

UGround

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

ICLR'25 Oral (1.8%)

SeeAct-V and UGround

Overview

Offline Experiments

Online Experiments

BibTeX

Navigating the Digital World as Humans Do:
Universal Visual Grounding for GUI Agents