UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin , Yining Ye , Junjie Fang , Haoming Wang , Shihao Liang , Shizuo Tian , Junda Zhang , Jiahao Li , Yunxin Li , Shijue Huang , Wanjun Zhong , Kuanye Li , Jiale Yang , Yu Miao , Woyu Lin , Longxiang Liu , Xu Jiang , Qianli Ma , Jingyu Li , Xiaojun Xiao , Kai Cai , Chuang Li , Yaowei Zheng , Chaolin Jin , Chen Li , Xiao Zhou , Minchao Wang , Haoli Chen , Zhaojian Li , Haihua Yang , Haifeng Liu , Feng Lin , Tao Peng , Xin Liu , Guang Shi

🏛 Institutions: ByteDance Seed , Tsinghua
📅 Date: January 21, 2025
📑 Publisher: arXiv
💻 Env: Desktop Mobile Web
🔑 Keywords: model GUI grounding unified action modeling system-2 reasoning reflective online traces UI-TARS

TLDR

UI-TARS is an end-to-end GUI agent model that acts directly from screenshots instead of relying on wrapper-style prompting workflows around proprietary models. It combines enhanced perception, unified cross-platform action modeling, deliberate multi-step reasoning, and iterative training on reflective online traces, and reports strong performance across ten-plus GUI benchmarks.

Open paper arXiv Report issue