Logo UGround

Navigating the Digital World as Humans Do:
Universal Visual Grounding for GUI Agents


1The Ohio State University      2Orby AI

MY ALT TEXT

UGround is a universal visual grounding model for locating the element of an action by pixel coordinates on the screen. We propose a generic SeeAct-V framework to build GUI agents with UGround, and achieve SOTA prompt-only performance on the following benchmarks with GPT-4/4o:

  • ScreenSpot (GUI Grounding)
  • Multimodal-Mind2Web
  • AndroidControl
  • OmniAct
  • Mind2Web-Live (Online)
  • AndroidWorld (Online)

Updates

  • 2024/8/6: Website is live. The initial paper and evaluation results are available. This is a work in progress, the evaluation results and model checkpoints will be kept updated in later versions.

Logo Method

Overview

MY ALT TEXT

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges significantly on the robustness of their grounding mechanisms. Prevalent GUI agents predominantly utilize text-based inputs such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we propose SeeAct-V, a generic vision-only framework for building GUI agents. It involves an MLLM as a planner to determine the next action, as well as a grounding model to retrieve the coordinates of target elements from screenshots. We introduce UGround, a universal pixel-level visual grounding model developed specifically for GUIs. This model, trained on 1.3 million diverse samples, is designed to ground open-ended element descriptions directly via pixel coordinates, and to function across different operating systems. Our comprehensive evaluation across six benchmarks, including desktop, mobile, and web platforms, demonstrates that UGround\ not only outperforms existing visual grounding models, but also matches or exceeds the performance of state-of-the-art methods that rely on HTML or accessibility trees. These results underscore UGround's practicability in significantly advancing the field of vision-based GUI agents, illustrating its ability to navigate digital environments with human-like perception and precision.

Offline Experiments

We evaluate UGround on a very recent visual grounding benchmark, ScreenSpot. Several multimodal large language models including LLaVa, GPT-4, and GPT-4o have been exploited for planning.

MY ALT TEXT

UGround demonstrated strong performance with or without a separate planinng model. Incorporated with GPT-4o, UGround achieved the new SOTA results on ScreenSpot in every dimension.

We also evaluate UGround on agent benchmarks, including Multimodal-Mind2Web, AndroidControl and OmniAct. This further shows that UGround can enable existing multimodal large language models to achieve strong performance just by relying on the visual inputs without HTML or accessibility trees.

AndroidControl

Multimodal-Mind2Web

Online Experiments

To further evaluate the approach's performance in more realistic settings, we evaluate UGround in online Web and Mobile environments using Mind2Web-Live and AndroidWorld.

AndroidWorld

Mind2Web-Live