Logo UGround

Navigating the Digital World as Humans Do:
Universal Visual Grounding for GUI Agents


1The Ohio State University      2Orby AI

TLDR: 1) Low-cost, scalable and effective synthetic data for GUI visaul grounding 2) SOTA GUI visual grounding model UGround 3) human-like purely vision-only (modular) GUI agent framework SeeAct-V 4) first time demonstrating practical SOTA performance of vision-only GUI agents.

UGround is a universal visual grounding model for locating the element of an action by pixel coordinates on GUIs. It is trained on 10M elements from 1.3M screenshots, and substantially outperforms previous SOTA GUI visual grounding models on ScreenSpot (web, mobile, desktop).
Different from prevalent approaches that rely on HTML/accessibility trees for observation or grounding, we propose a generic framework, SeeAct-V, that perceives the GUIs entirely visually, and takes pixel-level operations on screens. SeeAct-V Agents with UGround achieve SOTA performance on five benchmarks, spanning offline agent evaluation (web, mobile, desktop), and online agent evaluation (web, mobile):

MY ALT TEXT

Updates

  • 2025/01/03: Qwen2VL-based UGround-v1 has released (2B & 7B), with a even stronger SOTA performance on GUI visual grounding.
  • 2024/10/07: Preprint is arXived.
  • 2024/08/06: Website is live. The initial manuscript and results are available.

Logo SeeAct-V and UGround

Overview

MY ALT TEXT

In this work, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI (SeeAct-V framework). The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements (~95% from web) and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents.

Live Demo

Offline Experiments

The high-quality grounding data synthesized from web (9M elements from Web-Hybrid) effectively helps UGround generalize to Desktop and Mobile UIs, making UGround outperform previous SOTA SeeClick on every platform and element type on ScreenSpot.

MY ALT TEXT

ScreenSpot (Standard Setting)
Grounding Model Mobile Desktop Web Average
Text Icon/Widget Text Icon/Widget Text Icon/Widget
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
MiniGPT-v2 8.4 6.6 6.2 2.9 6.5 3.4 5.7
Groma 10.3 2.6 4.6 4.3 5.7 3.4 5.2
Fuyu 41.0 1.3 33.0 3.6 33.9 4.4 19.5
Qwen-VL 9.5 4.8 5.7 5.0 3.5 2.4 5.2
SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
Qwen-GUI 52.4 10.9 45.9 5.7 43.0 13.6 28.6
UGround-V1 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Qwen2-VL 61.3 39.3 52.0 45.0 33.0 21.8 42.1
Aguvis-G-7B 88.3 78.2 88.1 70.7 85.7 74.8 81.0
Aguvis-7B 95.6 77.7 93.8 67.1 88.3 75.2 83.0
OS-Atlas-Base-4B 85.7 58.5 72.2 45.7 82.6 63.1 68.0
OS-Atlas-Base-7B 93.0 72.9 91.8 62.9 90.9 74.3 81.0
ShowUI-G 91.6 69.0 81.8 59.0 83.0 65.5 75.0
ShowUI 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Iris 85.3 64.2 86.7 57.5 82.6 71.2 74.6
Aria-UI 92.3 73.8 93.3 64.3 86.5 76.2 81.1
UGround-V1-2B (Qwen2-VL) 89.4 72.0 88.7 65.7 81.3 68.9 77.7
UGround-V1-7B (Qwen2-VL) 93.0 79.9 93.8 76.4 90.9 84.0 86.3
ScreenSpot (Agent Setting)

Planner

Grounding
Mobile Desktop Web
Average
Text Icon/Widget Text Icon/Widget Text Icon/Widget

GPT-4
SeeClick 76.6 55.5 68.0 28.6 40.9 23.3 48.8
UGround-V1 90.1 70.3 87.1 55.7 85.7 64.6 75.6

GPT-4o
Qwen-VL 21.3 21.4 18.6 10.7 9.1 5.8 14.5
SeeClick 81.0 59.8 69.6 33.6 43.9 26.2 52.4
Qwen-GUI 67.8 24.5 53.1 16.4 50.4 18.5 38.5
UGround-V1 93.4 76.9 92.8 67.9 88.7 68.9 81.4
OS-Atlas-Base-4B 94.1 73.8 77.8 47.1 86.5 65.3 74.1
OS-Atlas-Base-7B 93.8 79.9 90.2 66.4 92.6 79.1 83.7
UGround-V1-2B (Qwen2-VL) 94.1 77.7 92.8 63.6 90.0 70.9 81.5
UGround-V1-7B (Qwen2-VL) 94.1 79.9 93.3 73.6 89.6 73.3 84.0
Multimodal-Mind2Web (Element Accuracy)
Input Planner Grounding Cross-Task Cross-Website Cross-Domain Average

Image + Text

GPT-4
Choice 46.4 38.0 42.4 42.3
SoM 29.6 20.1 27.0 25.6


Image
(SeeAct-V)

GPT-4
SeeClick 29.7 28.5 30.7 29.6
UGround-V1 45.1 44.7 44.6 44.8

GPT-4o
SeeClick 32.1 33.1 33.5 32.9
UGround-V1 47.7 46.0 46.6 46.8
AndroidControl

Input

Planner

Grounding
Step Accuracy
High Low
Text GPT-4 Choice 42.1 55.0


Image
(SeeAct-V)

GPT-4
SeeClick 39.4 47.2
UGround-V1 46.2 58.0

GPT-4o
SeeClick 41.8 52.8
UGround-V1 48.4 62.4
OmniACT
Inputs Planner Grounding Action Score
Text
GPT-4
DetACT 11.6
Image + Text DetACT 17.0


Image
(SeeAct-V)

GPT-4
SeeClick 28.9
UGround-V1 31.1

GPT-4o
SeeClick 29.6
UGround-V1 32.8

Online Experiments

Mind2Web-Live
Inputs Planner Grounding Completion Rate Task Success Rate

Text
GPT-4
Choice
44.3 21.1
GPT-4o 47.6 22.1
Image
(SeeAct-V)
GPT-4
UGround-V1
50.7 23.1
GPT-4o 50.8 19.2
AndroidWorld
Input Planner Grounding Task Success Rate
Text
GPT-4
Choice 30.6
Image + Text SoM 25.4
Image
(SeeAct-V)
GPT-4
UGround-V1
31.0
GPT-4o 32.8

BibTeX


@article{gou2024uground,
title = {Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author = {Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal = {arXiv preprint arXiv:2410.05243},
year = {2024},
url = {https://arxiv.org/abs/2410.05243},
}
@InProceedings{zheng2024seeact,
title = {{GPT}-4{V}(ision) is a Generalist Web Agent, if Grounded},
author = {Zheng, Boyuan and Gou, Boyu and Kil, Jihyung and Sun, Huan and Su, Yu},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {61349--61385},
year = {2024},
}