SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Yixing Li, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
- 🏛 Institutions
- Huawei Noah's Ark Lab, HIT-Shenzhen, Tianjin University, UCL
- 📅 Date
- October 19, 2024
- 📑 Publisher
- ICLR 2025 (Spotlight)
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
SPA-Bench is a smartphone-agent benchmark built around 340 Android tasks spanning single-app and cross-app settings in both English and Chinese, with system and third-party apps. It also provides a plug-and-play execution framework and an automatic evaluation pipeline with seven task-completion and resource-usage metrics, exposing persistent difficulties in mobile UI interpretation, grounding, and long-horizon execution.
Related papers
- REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real WebsitesApril 15, 2025 · arXiv
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal ModelsJanuary 25, 2024 · ACL 2024
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv
- Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-CorrectionApril 7, 2026 · ACL 2026
- Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive AssistantsApril 1, 2026 · arXiv