TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents
Bofei Zhang , Zirui Shang , Zhi Gao , Wang Zhang , Rui Xie , Xiaojian Ma , Tao Yuan , Xinxiao Wu , Song-Chun Zhu , Qing Li
- 🏛 Institutions
- State Key Laboratory of General Artificial Intelligence , BIGAI , Beijing Institute of Technology , PKU , SJTU , Tsinghua
- 📅 Date
- April 17, 2025
- 📑 Publisher
- AAAI 2026
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
TongUI turns multimodal web tutorials into large-scale GUI-agent training trajectories by crawling and processing tutorial videos and articles. The resulting GUI-Net dataset spans 143K trajectories across five operating systems and more than 200 applications, and fine-tuning on it improves generalized GUI-agent performance.
Related papers (24)
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent PretrainingMay 14, 2026 · arXiv
- Moving Beyond Sparse Grounding with Complete Screen Parsing SupervisionFebruary 15, 2026 · arXiv
- TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable EvolutionFebruary 10, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- VideoAgentTrek: Computer Use Pretraining from Unlabeled VideosOctober 22, 2025 · arXiv
- Scaling Synthetic Task Generation for Agents via ExplorationSeptember 29, 2025 · ICLR 2026 (Poster)
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction SynthesisApril 15, 2025 · Findings of ACL 2025
- OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task SynthesisDecember 27, 2024 · ACL 2025
- Falcon-UI: Understanding GUI Before Following User InstructionsDecember 12, 2024 · arXiv
- Aguvis: Unified Pure Vision Agents for Autonomous GUI InteractionDecember 5, 2024 · ICML 2025 (Poster)
- EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic DataOctober 25, 2024 · arXiv
- OmniParser for Pure Vision Based GUI AgentAugust 1, 2024 · arXiv
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens GroundingJune 27, 2024 · EMNLP 2024 (Poster)
- VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-TuningJune 20, 2024 · Findings of EMNLP 2024
- GUICourse: From General Vision Language Model to Versatile GUI AgentJune 17, 2024 · ACL 2025
- ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingFebruary 7, 2024 · IJCAI 2024
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsMay 30, 2023 · NeurIPS 2023
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv