TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents
Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li
- 🏛 Institutions
- State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing Institute of Technology, PKU, SJTU, Tsinghua
- 📅 Date
- April 17, 2025
- 📑 Publisher
- AAAI 2026
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
TongUI turns multimodal web tutorials into large-scale GUI-agent training trajectories by crawling and processing tutorial videos and articles. The resulting GUI-Net dataset spans 143K trajectories across five operating systems and more than 200 applications, and fine-tuning on it improves generalized GUI-agent performance.
Related papers
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable EvolutionFebruary 10, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- VideoAgentTrek: Computer Use Pretraining from Unlabeled VideosOctober 22, 2025 · arXiv
- Scaling Synthetic Task Generation for Agents via ExplorationSeptember 29, 2025 · ICLR 2026 (Poster)