GUI Agents Papers
Star · 751

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma

🏛 Institutions
Google Research
📅 Date
February 7, 2024
📑 Publisher
IJCAI 2024
💻 Env
General GUI
🔑 Keywords
TLDR

ScreenAI is a vision-language model for UI and infographics understanding that combines a PaLI-style architecture with pix2struct-style flexible patching. It introduces a screen-annotation task, uses it to generate large-scale UI training data, and releases three datasets for screen annotation and screen question answering.

Open paper Edit on GitHub Report issue
Related papers