Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges significantly on the robustness of their grounding mechanisms. Prevalent GUI agents predominantly utilize text-based inputs such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we propose SeeAct-V, a generic vision-only framework for building GUI agents. It involves an MLLM as a planner to determine the next action, as well as a grounding model to retrieve the coordinates of target elements from screenshots. We introduce UGround, a universal pixel-level visual grounding model developed specifically for GUIs. This model, trained on 1.3 million diverse samples, is designed to ground open-ended element descriptions directly via pixel coordinates, and to function across different operating systems. Our comprehensive evaluation across six benchmarks, including desktop, mobile, and web platforms, demonstrates that UGround\ not only outperforms existing visual grounding models, but also matches or exceeds the performance of state-of-the-art methods that rely on HTML or accessibility trees. These results underscore UGround's practicability in significantly advancing the field of vision-based GUI agents, illustrating its ability to navigate digital environments with human-like perception and precision.
We evaluate UGround on a very recent visual grounding benchmark, ScreenSpot. Several multimodal large language models including LLaVa, GPT-4, and GPT-4o have been exploited for planning.
UGround demonstrated strong performance with or without a separate planinng model. Incorporated with GPT-4o, UGround achieved the new SOTA results on ScreenSpot in every dimension.
We also evaluate UGround on agent benchmarks, including Multimodal-Mind2Web, AndroidControl and OmniAct. This further shows that UGround can enable existing multimodal large language models to achieve strong performance just by relying on the visual inputs without HTML or accessibility trees.
To further evaluate the approach's performance in more realistic settings, we evaluate UGround in online Web and Mobile environments using Mind2Web-Live and AndroidWorld.