Watch and Learn: Learning to Use Computers from Online Videos
Chan Hee Song , Yiwen Song , Palash Goyal , Yu Su , Oriana Riva , Hamid Palangi , Tomas Pfister
- 🏛 Institutions
- Google Cloud AI Research , OSU
- 📅 Date
- October 6, 2025
- 📑 Publisher
- CVPR 2026
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
Watch & Learn converts Internet videos of human computer use into more than 53K executable UI trajectories by framing annotation as an inverse dynamics problem over consecutive screen states. The resulting data improves both general-purpose and specialized CUAs on OSWorld and yields state-of-the-art 7B-scale performance on WindowsAgentArena under the 15-step limit.
Related papers (24)
- CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use AgentsMarch 25, 2026 · arXiv
- CoAct-1: Computer-using Multi-Agent System with Coding ActionsAugust 5, 2025 · ICLR 2026 (Poster)
- VideoAgentTrek: Computer Use Pretraining from Unlabeled VideosOctober 22, 2025 · arXiv
- Agent S2: A Compositional Generalist-Specialist Framework for Computer Use AgentsApril 1, 2025 · COLM 2025
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv
- GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play AnnotationMarch 27, 2026 · arXiv
- Video-Based Reward Modeling for Computer-Use AgentsMarch 10, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic ExperienceJanuary 22, 2026 · arXiv
- CaMeLs Can Use Computers Too: System-level Security for Computer Use AgentsJanuary 14, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using AgentsNovember 6, 2025 · arXiv
- Scaling Agents for Computer UseOctober 2, 2025 · arXiv
- Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data CurationSeptember 28, 2025 · arXiv
- ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use AgentsAugust 19, 2025 · ICLR 2026 (Poster)
- OpenCUA: Open Foundations for Computer-Use AgentsAugust 12, 2025 · NeurIPS 2025 (Spotlight)
- Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use AgentAugust 6, 2025 · arXiv
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOSMay 24, 2025 · arXiv
- Efficient Agent Training for Computer UseMay 20, 2025 · ICLR 2026 (Poster)
- UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and InteractionMarch 19, 2025 · ICML 2025 (Poster)
- STEVE: A Step Verification Pipeline for Computer-use Agent TrainingMarch 16, 2025 · arXiv
- DeskVision: Large Scale Desktop Region Captioning for Advanced GUI AgentsMarch 14, 2025 · arXiv