Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Zechen Bai , Zhiheng Chen , Yiqi Lin , Kevin Qinghong Lin , Difei Gao , Xiangwu Guo , Xin Wang , Mike Zheng Shou
- 🏛 Institutions
- Unknown
- 📅 Date
- June 2, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
Demo2Tutorial converts screen recordings and interaction logs into structured multimodal software tutorials with parsed actions, intents, and hierarchical task graphs. The paper evaluates tutorial generation quality and shows that the resulting representations improve downstream GUI-agent planning and generalization.
Related papers (24)
- Executable Agentic Memory for GUI AgentMay 12, 2026 · arXiv
- TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable EvolutionFebruary 10, 2026 · arXiv
- SSL: Sweet Spot Learning for Differentiated Guidance in Agentic OptimizationJanuary 30, 2026 · arXiv
- MobileWorldBench: Towards Semantic World Modeling For Mobile AgentsDecember 16, 2025 · arXiv
- WebATLAS: An LLM Agent with Experience-Driven Memory and Action SimulationOctober 26, 2025 · NeurIPS 2025 Workshop on Language Agents and World Models
- Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI AgentMay 20, 2025 · arXiv
- LLM-Powered GUI Agents in Phone Automation: Surveying Progress and ProspectsApril 28, 2025 · TMLR 2025
- WebRollback: Enhancing Web Agents with Explicit Rollback MechanismsApril 16, 2025 · EACL 2026 (Oral)
- LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent ApplicationsMarch 4, 2025 · NAACL 2025 System Demonstrations
- WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work TasksJuly 7, 2024 · NeurIPS 2024 Datasets and Benchmarks Track (Poster)
- A Real-World WebAgent with Planning, Long Context Understanding, and Program SynthesisJuly 24, 2023 · ICLR 2024 (Oral)
- Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI AgentsJune 12, 2026 · arXiv
- STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJune 1, 2026 · arXiv
- GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement LearningMay 29, 2026 · arXiv
- MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI AgentsMay 18, 2026 · arXiv
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent PretrainingMay 14, 2026 · arXiv
- LiteGUI: Distilling Compact GUI Agents with Reinforcement LearningMay 8, 2026 · arXiv
- Step-level Optimization for Efficient Computer-use AgentsApril 29, 2026 · arXiv
- Training Computer Use Agents to Assess the Usability of Graphical User InterfacesApril 28, 2026 · arXiv
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv
- Human-Guided Harm Recovery for Computer Use AgentsApril 20, 2026 · arXiv
- UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI GroundingApril 15, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual FeedbackApril 14, 2026 · arXiv