OmniParser for Pure Vision Based GUI Agent

Yadong Lu , Jianwei Yang , Yelong Shen , Ahmed Awadallah

🏛 Institutions: MSR , Microsoft GenAI
📅 Date: August 1, 2024
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: dataset screen parsing GUI grounding icon detection icon captioning OmniParser

TLDR

OmniParser parses UI screenshots into structured screen elements by combining interactable icon detection with element captioning. The paper also curates icon-related datasets and shows that this screen parsing layer improves GPT-4V grounding on ScreenSpot, Mind2Web, and AITW.

Open paper arXiv Report issue

Related papers (24)

Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

February 15, 2026 · arXiv
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

November 7, 2025 · arXiv
Scaling Computer‑Use Grounding via User Interface Decomposition and Synthesis

May 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

April 15, 2025 · Findings of ACL 2025
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

October 25, 2024 · arXiv
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024 · ICLR 2025 (Spotlight)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

October 7, 2024 · ICLR 2025 (Oral)
MobileViews: A Million-scale and Diverse Mobile GUI Dataset

September 22, 2024 · arXiv
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

January 17, 2024 · ACL 2024
GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

May 29, 2026 · arXiv
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

May 14, 2026 · arXiv
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

April 15, 2026 · arXiv
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

April 15, 2026 · arXiv
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

April 14, 2026 · arXiv
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

April 8, 2026 · Findings of ACL 2026
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

March 27, 2026 · CVPR 2026
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

March 18, 2026 · arXiv
Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

March 15, 2026 · arXiv
Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

February 6, 2026 · arXiv
POINTS-GUI-G: GUI-Grounding Journey

February 6, 2026 · arXiv
SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

January 30, 2026 · arXiv
GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

January 26, 2026 · arXiv
V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

January 11, 2026 · arXiv
MVP: Multiple View Prediction Improves GUI Grounding

December 9, 2025 · arXiv