GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang
- 🏛 Institutions
- HyperAI Team, Xiaomi
- 📅 Date
- March 16, 2026
- 📑 Publisher
- CVPR 2026
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
GUI-CEval is the first comprehensive Chinese benchmark for mobile GUI agents, spanning 201 apps across four device types with a hierarchical two-level evaluation structure (atomic abilities and application-level tasks) along five dimensions (perception, planning, reflection, execution, evaluation), revealing that most MLLMs still struggle with reflective decision-making and post-action self-evaluation.
Related papers
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- VideoGUI: A Benchmark for GUI Automation from Instructional VideosJune 14, 2024 · NeurIPS 2024 Datasets and Benchmarks Track
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv
- Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-CorrectionApril 7, 2026 · ACL 2026
- Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive AssistantsApril 1, 2026 · arXiv