SpiritSight Agent: Advanced GUI Agent with One Look

Zhiyuan Huang , Ziming Cheng , Junting Pan , Zhaohui Hou , Mingjie Zhan

🏛 Institutions: SenseTime Research , Beijing University of Posts and Telecommunications , MMLab , CUHK
📅 Date: March 5, 2025
📑 Publisher: CVPR 2025 (Poster)
💻 Env: Desktop Mobile Web
🔑 Keywords: model dataset GUI-Lasagne Universal Block Parsing single-screenshot inference SpiritSight

TLDR

SpiritSight is an end-to-end GUI agent designed to act from a single screenshot while retaining strong cross-platform grounding accuracy. The paper pairs the GUI-Lasagne dataset with Universal Block Parsing to reduce dynamic-resolution ambiguity and reports gains across web, mobile, and desktop benchmarks.

Open paper arXiv Report issue