A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation

Yuxiang Chai , Shunye Tang , Han Xiao , Weifeng Lin , Hanhao Li , Jiayu Zhang , Liang Liu , Pengxiang Zhao , Guangyi Liu , Guozhi Wang , Shuai Ren , Rongduo Han , Haining Zhang , Siyuan Huang , Hongsheng Li

🏛 Institutions: CUHK , vivo AI Lab , SJTU
📅 Date: January 2, 2025
📑 Publisher: arXiv
💻 Env: Mobile
🔑 Keywords: benchmark essential-state evaluation procedural evaluation reward model A3

TLDR

A3 is a mobile GUI benchmark built from 100 tasks over 20 dynamic online Android apps to evaluate agents beyond static or offline settings. Its essential-state procedural evaluation uses MLLMs as reward models to verify both intermediate progress and final completion on real online apps.

Open paper arXiv Report issue

Related papers (24)

The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

October 21, 2025 · arXiv
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

May 21, 2025 · NeurIPS 2025 (Spotlight)
Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

June 3, 2026 · arXiv
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

May 26, 2026 · arXiv
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

May 25, 2026 · arXiv
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

May 24, 2026 · arXiv
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

April 13, 2026 · arXiv
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

April 10, 2026 · arXiv
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

April 9, 2026 · arXiv
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

April 8, 2026 · arXiv
Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

April 7, 2026 · ACL 2026
Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

April 1, 2026 · arXiv
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

March 31, 2026 · arXiv
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

March 19, 2026 · arXiv
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

March 16, 2026 · CVPR 2026
Video-Based Reward Modeling for Computer-Use Agents

March 10, 2026 · arXiv
SecAgent: Efficient Mobile GUI Agent with Semantic Context

March 9, 2026 · arXiv
PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

March 9, 2026 · arXiv
Generalization in Online Reinforcement Learning for Mobile Agents

March 8, 2026 · arXiv
MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

February 28, 2026 · arXiv
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

February 24, 2026 · arXiv
AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

February 12, 2026 · arXiv
VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

February 6, 2026 · arXiv