Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister
- 🏛 Institutions
- Google Cloud AI Research, OSU
- 📅 Date
- October 6, 2025
- 📑 Publisher
- CVPR 2026
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
Watch & Learn converts Internet videos of human computer use into more than 53K executable UI trajectories by framing annotation as an inverse dynamics problem over consecutive screen states. The resulting data improves both general-purpose and specialized CUAs on OSWorld and yields state-of-the-art 7B-scale performance on WindowsAgentArena under the 15-step limit.
Related papers
- CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use AgentsMarch 25, 2026 · arXiv
- CoAct-1: Computer-using Multi-Agent System with Coding ActionsAugust 5, 2025 · ICLR 2026 (Poster)
- VideoAgentTrek: Computer Use Pretraining from Unlabeled VideosOctober 22, 2025 · arXiv
- Agent S2: A Compositional Generalist-Specialist Framework for Computer Use AgentsApril 1, 2025 · COLM 2025
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv