GUI Agents Papers
Star · 751

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova

🏛 Institutions
Google
📅 Date
February 1, 2023
📑 Publisher
ICML 2023
💻 Env
🔑 Keywords
TLDR

Pix2Struct pretrains an image-to-text model by reconstructing simplified HTML from masked webpage screenshots, then transfers it across documents, illustrations, user interfaces, and natural images. It matters for GUI research because it improves UI understanding, but the paper is broader than direct GUI-agent work.

Open paper Edit on GitHub Report issue
Related papers