Watch and Learn: Learning to Use Computers from Online Videos

Chan Hee Song , Yiwen Song , Palash Goyal , Yu Su , Oriana Riva , Hamid Palangi , Tomas Pfister

🏛 Institutions: Google Cloud AI Research , OSU
📅 Date: October 6, 2025
📑 Publisher: CVPR 2026
💻 Env: Desktop
🔑 Keywords: dataset video demonstrations inverse dynamics trajectory annotation OSWorld WindowsAgentArena Watch & Learn

TLDR

Watch & Learn converts Internet videos of human computer use into more than 53K executable UI trajectories by framing annotation as an inverse dynamics problem over consecutive screen states. The resulting data improves both general-purpose and specialized CUAs on OSWorld and yields state-of-the-art 7B-scale performance on WindowsAgentArena under the 15-step limit.

Open paper arXiv Report issue