Video-Based Reward Modeling for Computer-Use Agents
Linxin Song , Jieyu Zhang , Huanxin Sheng , Taiwei Shi , Gupta Rahul , Yang Liu , Ranjay Krishna , Jian Kang , Jieyu Zhao
- 🏛 Institutions
- USC , University of Washington , MBZUAI , Amazon AGI
- 📅 Date
- March 10, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop Mobile
- 🔑 Keywords
TLDR
This paper studies reward modeling from execution video rather than agent internals, introducing the ExeVR-53k dataset and an execution-video reward model that predicts success from keyframes plus the user instruction. The model scales evaluation across Ubuntu, macOS, Windows, and Android, outperforming strong proprietary models while providing finer temporal attribution.
Related papers (24)
- UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI AgentsMay 27, 2025 · NeurIPS 2025 (Poster)
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesApril 11, 2025 · COLM 2025
- SpiritSight Agent: Advanced GUI Agent with One LookMarch 5, 2025 · CVPR 2025 (Poster)
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task SynthesisDecember 27, 2024 · ACL 2025
- OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsOctober 30, 2024 · ICLR 2025 (Spotlight)
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI AgentsOctober 7, 2024 · ICLR 2025 (Oral)
- GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented UnderstandingJune 16, 2024 · ICLR 2025 (Poster)
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source ApplicationsMay 26, 2026 · arXiv
- ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI AgentsApril 13, 2026 · arXiv
- Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple ActionsApril 8, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use AgentsMarch 25, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- Turing Test on Screen: A Benchmark for Mobile GUI Agent HumanizationFebruary 24, 2026 · arXiv
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the WildFebruary 12, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge EvolutionJanuary 27, 2026 · arXiv
- SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe SynthesisJanuary 26, 2026 · arXiv