VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
- 🏛 Institutions
- Google Cloud AI Research, OSU
- 📅 Date
- October 22, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
VideoAgentTrek studies how to pretrain computer-use agents from passive screen recordings instead of manually labeled trajectories. Its Video2Action pipeline recovers action boundaries and structured parameters from 39,000 tutorial videos, yielding 1.52 million steps that improve both OSWorld-Verified and AgentNetBench after continued pretraining.
Related papers
- Watch and Learn: Learning to Use Computers from Online VideosOctober 6, 2025 · CVPR 2026
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- Scaling Synthetic Task Generation for Agents via ExplorationSeptember 29, 2025 · ICLR 2026 (Poster)
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)