SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan
- 🏛 Institutions
- SenseTime Research, Beijing University of Posts and Telecommunications, MMLab, CUHK
- 📅 Date
- March 5, 2025
- 📑 Publisher
- CVPR 2025 (Poster)
- 💻 Env
- Desktop Mobile Web
- 🔑 Keywords
TLDR
SpiritSight is an end-to-end GUI agent designed to act from a single screenshot while retaining strong cross-platform grounding accuracy. The paper pairs the GUI-Lasagne dataset with Universal Block Parsing to reduce dynamic-resolution ambiguity and reports gains across web, mobile, and desktop benchmarks.
Related papers
- OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsOctober 30, 2024 · ICLR 2025 (Spotlight)
- OpenCUA: Open Foundations for Computer-Use AgentsAugust 12, 2025 · NeurIPS 2025 (Spotlight)
- ShowUI: One Vision-Language-Action Model for GUI Visual AgentNovember 26, 2024 · CVPR 2025 (Poster)
- CogAgent: A Visual Language Model for GUI AgentsDecember 14, 2023 · CVPR 2024 (Highlight)
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv