Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee
- 🏛 Institutions
- KAIST, Sungkyunkwan University, Korea University, Fluiz
- 📅 Date
- December 14, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
MobiBench is an offline mobile-agent benchmark that explicitly supports multiple valid action paths and evaluates agent modules separately rather than treating the system as a black box. The paper reports 94.72% agreement with human evaluators while preserving the scalability and reproducibility advantages of offline evaluation.
Related papers
- SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy TasksJanuary 26, 2026 · ICLR 2026 (Poster)
- Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile AgentsMay 17, 2025 · arXiv
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv
- Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-CorrectionApril 7, 2026 · ACL 2026
- Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive AssistantsApril 1, 2026 · arXiv