GUI Agents Papers
Star · 751

Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister

🏛 Institutions
Google Cloud AI Research, OSU
📅 Date
October 6, 2025
📑 Publisher
CVPR 2026
💻 Env
Desktop
🔑 Keywords
TLDR

Watch & Learn converts Internet videos of human computer use into more than 53K executable UI trajectories by framing annotation as an inverse dynamics problem over consecutive screen states. The resulting data improves both general-purpose and specialized CUAs on OSWorld and yields state-of-the-art 7B-scale performance on WindowsAgentArena under the 15-step limit.

Open paper arXiv Edit on GitHub Report issue
Related papers