GUI Agents Papers
Star · 821

CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

Haojia Lin , Xiaoyu Tan , Yulei Qin , Zihan Xu , Yuchen Shi , Zongyi Li , Gang Li , Shaofei Cai , Siqi Cai , Chaoyou Fu , Ke Li , Xing Sun

🏛 Institutions
Tencent Youtu Lab , PKU , NJU
📅 Date
October 21, 2025
📑 Publisher
arXiv
💻 Env
Desktop
🔑 Keywords
TLDR

CUARewardBench benchmarks both outcome and process reward models for desktop computer-use evaluation using expert-annotated trajectories from 10 software categories and 7 agent architectures. It shows that current reward models are still unreliable and introduces Unanimous Prompt Ensemble (UPE) to improve reward-model precision.

Open paper arXiv Report issue
Related papers (24)