MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Qinzhuo Wu , Zhizhuo Yang , Hanhao Li , Pengzhi Gao , Wei Liu , Jian Luan

🏛 Institutions: MiLM Plus , Xiaomi , PKU , CUHK
📅 Date: January 28, 2026
📑 Publisher: arXiv
💻 Env: Mobile
🔑 Keywords: benchmark chinese benchmark long-horizon tasks noise robustness auto-evaluation MobileBench-OL

TLDR

MobileBench-OL benchmarks mobile GUI agents on 1,080 online tasks from 80 Chinese apps. It extends evaluation beyond instruction following to long-horizon execution, reasoning and exploration, and robustness to real-world noise, and pairs the benchmark with an automatic evaluation pipeline that supports environment reset.

Open paper arXiv Report issue