VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Dunjie Lu , Yiheng Xu , Junli Wang , Haoyuan Wu , Xinyuan Wang , Zekun Wang , Junlin Yang , Hongjin Su , Jixuan Chen , Junda Chen , Yuchen Mao , Jingren Zhou , Junyang Lin , Binyuan Hui , Tao Yu
- 🏛 Institutions
- Google Cloud AI Research , OSU
- 📅 Date
- October 22, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
VideoAgentTrek studies how to pretrain computer-use agents from passive screen recordings instead of manually labeled trajectories. Its Video2Action pipeline recovers action boundaries and structured parameters from 39,000 tutorial videos, yielding 1.52 million steps that improve both OSWorld-Verified and AgentNetBench after continued pretraining.
Related papers (24)
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent PretrainingMay 14, 2026 · arXiv
- Watch and Learn: Learning to Use Computers from Online VideosOctober 6, 2025 · CVPR 2026
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- Scaling Synthetic Task Generation for Agents via ExplorationSeptember 29, 2025 · ICLR 2026 (Poster)
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI AgentsApril 17, 2025 · AAAI 2026
- UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction SynthesisApril 15, 2025 · Findings of ACL 2025
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task SynthesisDecember 27, 2024 · ACL 2025
- Falcon-UI: Understanding GUI Before Following User InstructionsDecember 12, 2024 · arXiv
- Aguvis: Unified Pure Vision Agents for Autonomous GUI InteractionDecember 5, 2024 · ICML 2025 (Poster)
- EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic DataOctober 25, 2024 · arXiv
- OmniParser for Pure Vision Based GUI AgentAugust 1, 2024 · arXiv
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens GroundingJune 27, 2024 · EMNLP 2024 (Poster)
- VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-TuningJune 20, 2024 · Findings of EMNLP 2024
- GUICourse: From General Vision Language Model to Versatile GUI AgentJune 17, 2024 · ACL 2025
- ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingFebruary 7, 2024 · IJCAI 2024
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsMay 30, 2023 · NeurIPS 2023
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv