GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Yang Li , Yuchen Liu , Haoyu Lu , Zhiqiang Xia , Hongzhen Wang , Kaiyang Han , Changpeng Yang , Jinyang Wu , Jiaming Xu , Runyu Shi , Ying Huang

🏛 Institutions: HyperAI Team , Xiaomi
📅 Date: March 16, 2026
📑 Publisher: CVPR 2026
💻 Env: Mobile
🔑 Keywords: benchmark chinese hierarchical evaluation physical-device evaluation GUI-CEval

TLDR

GUI-CEval is the first comprehensive Chinese benchmark for mobile GUI agents, spanning 201 apps across four device types with a hierarchical two-level evaluation structure (atomic abilities and application-level tasks) along five dimensions (perception, planning, reflection, execution, evaluation), revealing that most MLLMs still struggle with reflective decision-making and post-action self-evaluation.

Open paper arXiv Report issue