Visual Grounding for User Interfaces

Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva

🏛 Institutions: CMU, UC Santa Barbara, Google Research
📅 Date: June 16, 2024
📑 Publisher: NAACL 2024 Industry Track
💻 Env: General GUI
🔑 Keywords: GUI grounding visual grounding UI element localization layout-guided contrastive learning multi-context learning LVG

TLDR

This paper defines visual UI grounding, where a model must localize the UI element referenced by a natural-language command directly from a screenshot without relying on UI metadata. It proposes LVG, which combines layout-guided contrastive learning with synthetic-to-real multi-context learning and improves top-1 accuracy by more than 4.9 points over strong baselines.

Open paper Edit on GitHub Report issue