Logo UGround

Navigating the Digital World as Humans Do:
Universal Visual Grounding for GUI Agents


1The Ohio State University      2Orby AI

UGround is a universal visual grounding model for locating the element of an action by pixel coordinates on GUIs. It is trained on 10M elements from 1.3M screenshots, and substantially outperforms previous SOTA GUI visual grounding models on ScreenSpot (web, mobile, desktop).
Different from prevalent approaches that rely on HTML/accessibility trees for observation or grounding, we propose a generic framework, SeeAct-V, that perceives the GUIs entirely visually, and takes pixel-level operations on screens. SeeAct-V Agents with UGround achieve SOTA performance on five benchmarks, spanning offline agent evaluation (web, mobile, desktop), and online agent evaluation (web, mobile):

MY ALT TEXT

Updates

  • 2024/11/04: Initial Results of Qwen2VL-based UGround on ScreenSpot is released. UGround-v1.1 is coming.
  • 2024/10/07: Preprint is arXived.
  • 2024/08/06: Website is live. The initial manuscript and results are available.

Logo SeeAct-V and UGround

Overview

MY ALT TEXT

In this work, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI (SeeAct-V framework). The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements (~95% from web) and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents.

Live Demo

Offline Experiments

The high-quality grounding data synthesized from web (9M elements from Web-Hybrid) effectively helps UGround generalize to Desktop and Mobile UIs, making UGround outperform previous SOTA SeeClick on every platform and element type on ScreenSpot.

ScreenSpot (Standard Setting)

Grounding Model
Mobile Desktop Web
Average
Text Icon/Widget Text Icon/Widget Text Icon/Widget
MiniGPT-v2 8.4 6.6 6.2 2.9 6.5 3.4 5.7
Qwen-VL 9.5 4.8 5.7 5.0 3.5 2.4 5.2
Groma 10.3 2.6 4.6 4.3 5.7 3.4 5.1
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
Fuyu 41.0 1.3 33.0 3.6 33.9 4.4 19.5
CogAgent 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
UGround-v1-LLaVA 82.8 (+4.8) 60.3 (+8.3) 82.5 (+10.3) 63.6 (+33.6) 80.4 (+24.7) 70.4 (+37.9) 73.3 (+19.9)
ScreenSpot (Agent Setting)

Planner

Grounding
Mobile Desktop Web
Average
Text Icon/Widget Text Icon/Widget Text Icon/Widget

GPT-4
SeeClick 76.6 55.5 68.0 28.6 40.9 23.3 48.8
UGround 90.1 (+13.5) 70.3 (+14.8) 87.1 (+19.1) 55.7 (+27.1) 85.7 (+44.8) 64.6 (+41.3) 75.6 (+26.8)

GPT-4o
SeeClick 81.0 59.8 69.6 33.6 43.9 26.2 52.3
OS-Atlas-7B 93.8 79.9 90.2 66.4 92.6 79.1 85.4
UGround-v1-LLaVA 93.4 (+12.4) 76.9 (+17.1) 92.8 (+23.2) 67.9 (+34.3) 88.7 (+44.8) 68.9 (+42.7) 81.4 (+29.1)
UGround-v1-Qwen 96.9 (+15.9) 82.3 (+22.5) 93.8 (+24.2) 75.8 (+42.2) 93.8 (+49.9) 73.4 (+47.2) 85.9 (+33.6)
Multimodal-Mind2Web (Element Accuracy)
Input Planner Grounding Cross-Task Cross-Website Cross-Domain Average

Image + Text

GPT-4
Choice 46.4 38.0 42.4 42.3
SoM 29.6 20.1 27.0 25.6


Image
(SeeAct-V)

GPT-4
SeeClick 29.7 28.5 30.7 29.6
UGround 45.1 44.7 44.6 44.8

GPT-4o
SeeClick 32.1 33.1 33.5 32.9
UGround 47.7 46.0 46.6 46.8
AndroidControl

Input

Planner

Grounding
Step Accuracy
High Low
Text GPT-4 Choice 42.1 55.0


Image
(SeeAct-V)

GPT-4
SeeClick 39.4 47.2
UGround 46.2 58.0

GPT-4o
SeeClick 41.8 52.8
UGround 48.4 62.4
OmniACT
Inputs Planner Grounding Action Score
Text
GPT-4
DetACT 11.6
Image + Text DetACT 17.0


Image
(SeeAct-V)

GPT-4
SeeClick 28.9
UGround 31.1

GPT-4o
SeeClick 29.6
UGround 32.8

Online Experiments

Mind2Web-Live
Inputs Planner Grounding Completion Rate Task Success Rate

Text
GPT-4
Choice
44.3 21.1
GPT-4o 47.6 22.1
Image
(SeeAct-V)
GPT-4
UGround
50.7 23.1
GPT-4o 50.8 19.2
AndroidWorld
Input Planner Grounding Task Success Rate
Text
GPT-4
Choice 30.6
Image + Text SoM 25.4
Image
(SeeAct-V)
GPT-4
UGround
31.0
GPT-4o 32.8

BibTeX


      @article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }