GUI Agents Papers
Star · 751

CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun

🏛 Institutions
Tencent Youtu Lab, PKU, NJU
📅 Date
October 21, 2025
📑 Publisher
arXiv
💻 Env
Desktop
🔑 Keywords
TLDR

CUARewardBench benchmarks both outcome and process reward models for desktop computer-use evaluation using expert-annotated trajectories from 10 software categories and 7 agent architectures. It shows that current reward models are still unreliable and introduces Unanimous Prompt Ensemble (UPE) to improve reward-model precision.

Open paper arXiv Edit on GitHub Report issue
Related papers