See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang
- 🏛 Institutions
- East China Normal University, Tongji University
- 📅 Date
- February 11, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
ScratchWorld evaluates multimodal GUI agents on Scratch program construction tasks that require fine-grained drag-and-drop manipulation. The benchmark exposes a large gap between high-level planning success and low-level GUI execution.
Related papers
- UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and InteractionMarch 19, 2025 · ICML 2025 (Poster)
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv