GUI Agents Papers
Star · 751

OmniParser for Pure Vision Based GUI Agent

Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

🏛 Institutions
MSR, Microsoft GenAI
📅 Date
August 1, 2024
📑 Publisher
arXiv
💻 Env
General GUI
🔑 Keywords
TLDR

OmniParser parses UI screenshots into structured screen elements by combining interactable icon detection with element captioning. The paper also curates icon-related datasets and shows that this screen parsing layer improves GPT-4V grounding on ScreenSpot, Mind2Web, and AITW.

Open paper arXiv Edit on GitHub Report issue
Related papers