Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu , Zhizheng Jiang , Yuxuan Liu , Pengzhi Gao , Wei Liu , Jian Luan , Yuanchun Li , Yunxin Liu , Bin Wang , Bo An

🏛 Institutions: NTU , University of Electronic Science and Technology of China , Renmin University of China , XiaoMi AI Lab , Institute for AI Industry Research (AIR) , Tsinghua
📅 Date: May 17, 2025
📑 Publisher: arXiv
💻 Env: Mobile
🔑 Keywords: benchmark multi-path evaluation noisy environments ambiguous instructions Mobile-Bench-v2

TLDR

Mobile-Bench-v2 is a more realistic mobile-agent benchmark that fixes three weaknesses of earlier evaluation: single-path scoring, unrealistically clean environments, and over-specified instructions. It adds multi-path offline evaluation, noisy app settings with pop-ups and ads, and ambiguous-instruction splits for testing proactive interaction.

Open paper arXiv Report issue