GUI Agents Papers
Star · 821

GUI Action Narrator: Where and When Did That Action Take Place?

Qinchen Wu , Difei Gao , Kevin Qinghong Lin , Zhuoyu Wu , Xiangwu Guo , Peiran Li , Weichen Zhang , Hengxu Wang , Mike Zheng Shou

🏛 Institutions
Show Lab , NUS , CAS , Shenzhen
📅 Date
June 19, 2024
📑 Publisher
arXiv
💻 Env
Desktop Web
🔑 Keywords
TLDR

GUI Action Narrator introduces Act2Cap, a benchmark and dataset of 4,189 GUI action video-captioning samples covering actions such as clicks, drags, and typing across desktop software and web tools. It also proposes GUI Narrator, which uses the cursor as a visual prompt plus temporal and spatial sampling to caption those actions more accurately than off-the-shelf multimodal models.

Open paper arXiv Report issue
Related papers (24)