ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Xiaoce Wang , Guibin Zhang , Junzhe Li , Jinzhe Tu , Chun Li , Ming Li

🏛 Institutions: Tsinghua , NUS , PKU , Shenzhen MSU-BIT University , Guangming Laboratory
📅 Date: January 30, 2026
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: tool use tokenization curriculum learning visual grounding data efficiency ToolTok

TLDR

ToolTok treats GUI operations as paths over learnable tool tokens with semantic anchoring and curriculum learning. This makes GUI agents more efficient and generalizable, reaching competitive performance with far less training data than other post-training methods.

Open paper arXiv Report issue

Related papers (24)

CocoaBench: Evaluating Unified Digital Agents in the Wild

April 13, 2026 · arXiv
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

April 9, 2026 · arXiv
Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

February 10, 2026 · arXiv
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

December 26, 2025 · arXiv
Visual Grounding for User Interfaces

June 16, 2024 · NAACL 2024 Industry Track
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
The Tool Illusion: Rethinking Tool Use in Web Agents

April 3, 2026 · arXiv
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

February 15, 2026 · arXiv
Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

January 30, 2026 · arXiv
How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors

January 29, 2026 · arXiv
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

May 21, 2025 · arXiv
GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI Agents

April 14, 2025 · arXiv
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

December 13, 2024 · arXiv
Dual-View Visual Contextualization for Web Navigation

February 6, 2024 · CVPR 2024 (Poster)
Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

June 12, 2026 · arXiv
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

June 2, 2026 · arXiv
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

June 1, 2026 · arXiv
GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

May 29, 2026 · arXiv
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

May 18, 2026 · arXiv
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

May 14, 2026 · arXiv
Executable Agentic Memory for GUI Agent

May 12, 2026 · arXiv
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

May 8, 2026 · arXiv
Step-level Optimization for Efficient Computer-use Agents

April 29, 2026 · arXiv
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

April 28, 2026 · arXiv