Video-Based Reward Modeling for Computer-Use Agents

Linxin Song , Jieyu Zhang , Huanxin Sheng , Taiwei Shi , Gupta Rahul , Yang Liu , Ranjay Krishna , Jian Kang , Jieyu Zhao

🏛 Institutions: USC , University of Washington , MBZUAI , Amazon AGI
📅 Date: March 10, 2026
📑 Publisher: arXiv
💻 Env: Desktop Mobile
🔑 Keywords: reward model dataset execution video trajectory evaluation spatiotemporal token pruning ExeVR-53k ExeVRM

TLDR

This paper studies reward modeling from execution video rather than agent internals, introducing the ExeVR-53k dataset and an execution-video reward model that predicts success from keyframes plus the user instruction. The model scales evaluation across Ubuntu, macOS, Windows, and Android, outperforming strong proprietary models while providing finer temporal attribution.

Open paper arXiv Report issue