ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma

🏛 Institutions: Google Research
📅 Date: February 7, 2024
📑 Publisher: IJCAI 2024
💻 Env: General GUI
🔑 Keywords: model dataset screen annotation UI understanding ScreenAI

TLDR

ScreenAI is a vision-language model for UI and infographics understanding that combines a PaLI-style architecture with pix2struct-style flexible patching. It introduces a screen-annotation task, uses it to generate large-scale UI training data, and releases three datasets for screen annotation and screen question answering.

Open paper Edit on GitHub Report issue