Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova
- 🏛 Institutions
- 📅 Date
- February 1, 2023
- 📑 Publisher
- ICML 2023
- 💻 Env
- 🔑 Keywords
TLDR
Pix2Struct pretrains an image-to-text model by reconstructing simplified HTML from masked webpage screenshots, then transfers it across documents, illustrations, user interfaces, and natural images. It matters for GUI research because it improves UI understanding, but the paper is broader than direct GUI-agent work.
Related papers
- UIBert: Learning Generic Multimodal Representations for UI UnderstandingJuly 29, 2021 · IJCAI 2021
- Training Computer Use Agents to Assess the Usability of Graphical User InterfacesApril 28, 2026 · arXiv
- ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI AgentsApril 13, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv