VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Dunjie Lu , Yiheng Xu , Junli Wang , Haoyuan Wu , Xinyuan Wang , Zekun Wang , Junlin Yang , Hongjin Su , Jixuan Chen , Junda Chen , Yuchen Mao , Jingren Zhou , Junyang Lin , Binyuan Hui , Tao Yu

🏛 Institutions: Google Cloud AI Research , OSU
📅 Date: October 22, 2025
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: dataset pretraining video mining inverse dynamics Video2Action VideoAgentTrek

TLDR

VideoAgentTrek studies how to pretrain computer-use agents from passive screen recordings instead of manually labeled trajectories. Its Video2Action pipeline recovers action boundaries and structured parameters from 39,000 tutorial videos, yielding 1.52 million steps that improve both OSWorld-Verified and AgentNetBench after continued pretraining.

Open paper arXiv Report issue