SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Jingxuan Chen , Derek Yuen , Bin Xie , Yuhao Yang , Gongwei Chen , Zhihao Wu , Yixing Li , Xurui Zhou , Weiwen Liu , Shuai Wang , Kaiwen Zhou , Rui Shao , Liqiang Nie , Yasheng Wang , Jianye Hao , Jun Wang , Kun Shao

🏛 Institutions: Huawei Noah's Ark Lab , HIT-Shenzhen , Tianjin University , UCL
📅 Date: October 19, 2024
📑 Publisher: ICLR 2025 (Spotlight)
💻 Env: Mobile
🔑 Keywords: benchmark automatic evaluation cross-app tasks smartphone agent evaluation SPA-Bench

TLDR

SPA-Bench is a smartphone-agent benchmark built around 340 Android tasks spanning single-app and cross-app settings in both English and Chinese, with system and third-party apps. It also provides a plug-and-play execution framework and an automatic evaluation pipeline with seven task-completion and resource-usage metrics, exposing persistent difficulties in mobile UI interpretation, grounding, and long-horizon execution.

Open paper Report issue