ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
Xiaoce Wang , Guibin Zhang , Junzhe Li , Jinzhe Tu , Chun Li , Ming Li
- 🏛 Institutions
- Tsinghua , NUS , PKU , Shenzhen MSU-BIT University , Guangming Laboratory
- 📅 Date
- January 30, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
ToolTok treats GUI operations as paths over learnable tool tokens with semantic anchoring and curriculum learning. This makes GUI agents more efficient and generalizable, reaching competitive performance with far less training data than other post-training methods.
Related papers (24)
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element InjectionApril 9, 2026 · arXiv
- Autonomous Continual Learning of Computer-Use Agents for Environment AdaptationFebruary 10, 2026 · arXiv
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive PerceptionDecember 26, 2025 · arXiv
- Visual Grounding for User InterfacesJune 16, 2024 · NAACL 2024 Industry Track
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- The Tool Illusion: Rethinking Tool Use in Web AgentsApril 3, 2026 · arXiv
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI AgentsFebruary 15, 2026 · arXiv
- Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent TrainingJanuary 30, 2026 · arXiv
- How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design FactorsJanuary 29, 2026 · arXiv
- ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and SearchMay 21, 2025 · arXiv
- GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI AgentsApril 14, 2025 · arXiv
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Dual-View Visual Contextualization for Web NavigationFebruary 6, 2024 · CVPR 2024 (Poster)
- Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI AgentsJune 12, 2026 · arXiv
- Demo2Tutorial: From Human Experience to Multimodal Software TutorialsJune 2, 2026 · arXiv
- STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language ModelsJune 1, 2026 · arXiv
- GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement LearningMay 29, 2026 · arXiv
- MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI AgentsMay 18, 2026 · arXiv
- Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent PretrainingMay 14, 2026 · arXiv
- Executable Agentic Memory for GUI AgentMay 12, 2026 · arXiv
- LiteGUI: Distilling Compact GUI Agents with Reinforcement LearningMay 8, 2026 · arXiv
- Step-level Optimization for Efficient Computer-use AgentsApril 29, 2026 · arXiv
- Training Computer Use Agents to Assess the Usability of Graphical User InterfacesApril 28, 2026 · arXiv