Harnessing Webpage UIs for Text-Rich Visual Understanding

Junpeng Liu , Tianyue Ou , Yifan Song , Yuxiao Qu , Wai Lam , Chenyan Xiong , Wenhu Chen , Graham Neubig , Xiang Yue

🏛 Institutions: CMU , CUHK , PKU , University of Waterloo
📅 Date: October 17, 2024
📑 Publisher: ICLR 2025 (Poster)
💻 Env: Web
🔑 Keywords: dataset instruction synthesis text-rich visual understanding web accessibility tree MultiUI

TLDR

This paper builds MultiUI, a 7.3M-sample dataset synthesized from 1M websites by pairing webpage screenshots with instructions generated from cleaned accessibility trees. Training on MultiUI improves web UI understanding and also transfers to broader text-rich visual tasks such as OCR, document understanding, and chart interpretation.

Open paper Report issue