See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Xingyi Zhang , Yulei Ye , Kaifeng Huang , Wenhao Li , Xiangfeng Wang

🏛 Institutions: East China Normal University , Tongji University
📅 Date: February 11, 2026
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: benchmark ScratchWorld drag-and-drop block-based programming reasoning-acting gap scratch

TLDR

ScratchWorld evaluates multimodal GUI agents on Scratch program construction tasks that require fine-grained drag-and-drop manipulation. The benchmark exposes a large gap between high-level planning success and low-level GUI execution.

Open paper arXiv Report issue

Related papers (24)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

March 19, 2025 · ICML 2025 (Poster)
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

April 27, 2026 · arXiv
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

April 15, 2026 · arXiv
CocoaBench: Evaluating Unified Digital Agents in the Wild

April 13, 2026 · arXiv
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

April 8, 2026 · Findings of ACL 2026
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

April 6, 2026 · arXiv
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

March 26, 2026 · CVPR 2026
LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

February 3, 2026 · arXiv
LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent

January 26, 2026 · ICLR 2026 (Poster)
GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

January 26, 2026 · arXiv
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

December 1, 2025 · AAAI 2026 TrustAgent Workshop
MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

November 30, 2025 · arXiv
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging

November 7, 2025 · arXiv
SCUBA: Salesforce Computer Use Benchmark

September 30, 2025 · ICLR 2026 (Poster)
Scaling Computer‑Use Grounding via User Interface Decomposition and Synthesis

May 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis

April 15, 2025 · Findings of ACL 2025
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

June 27, 2024 · EMNLP 2024 (Poster)
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models

May 30, 2023 · NeurIPS 2023
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

June 9, 2026 · arXiv
Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

June 3, 2026 · arXiv
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

May 26, 2026 · arXiv
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

May 25, 2026 · arXiv
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

May 24, 2026 · arXiv
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

April 30, 2026 · arXiv