AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

Yifan Sui , Xin Huang , Hongbing Li , Fang Xu , Jiahe Lv , Haolong Yan , Yeqing Shen , Litao Liu , Zhimin Fan , Ziyang Meng , Jia Wang , Junbo Qi , Kaijun Tan , Zheng Ge , Xiangyu Zhang , Daxin Jiang , Osamu Yoshie

🏛 Institutions: BUPT , StepFun , Waseda
📅 Date: May 26, 2026
📑 Publisher: arXiv
💻 Env: Mobile
🔑 Keywords: benchmark mobile GUI agent closed-source apps trajectory evaluation AndroidDaily GRADE

TLDR

AndroidDaily is a verifiable benchmark of 350 daily-use tasks across 94 commercial, closed-source Android apps for evaluating mobile GUI agents. It introduces GRADE, an evaluator that judges agents by tracking the visual trajectory against observable external guidelines rather than internal app state, reaching 87.37% agreement with human judgment.

Open paper arXiv Report issue