A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, Hongsheng Li
- 🏛 Institutions
- CUHK, vivo AI Lab, SJTU
- 📅 Date
- January 2, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
A3 is a mobile GUI benchmark built from 100 tasks over 20 dynamic online Android apps to evaluate agents beyond static or offline settings. Its essential-state procedural evaluation uses MLLMs as reward models to verify both intermediate progress and final completion on real online apps.
Related papers
- The Art of Building Verifiers for Computer Use AgentsApril 5, 2026 · arXiv
- CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using AgentOctober 21, 2025 · arXiv
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI AgentsApril 13, 2026 · arXiv
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv