GUI Agents Papers
Star · 751

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng

🏛 Institutions
Fudan, Jilin University
📅 Date
February 12, 2026
📑 Publisher
arXiv
💻 Env
Mobile
🔑 Keywords
TLDR

Reframes mobile-agent evaluation around intent alignment rather than perfect one-shot instructions by organizing 240 real-world tasks across 25 apps into four clarity levels from detailed to ambiguous. It also introduces MUSE, an automated judge that scores not just task completion but interaction quality, showing that current agents still struggle badly when users are incomplete or ambiguous.

Open paper arXiv Edit on GitHub Report issue
Related papers