Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang

🏛 Institutions: UC Santa Cruz, eBay Inc., Cybever
📅 Date: June 27, 2024
📑 Publisher: EMNLP 2024 (Poster)
💻 Env: General GUI
🔑 Keywords: benchmark dataset screen reading ScreenPR Tree-of-Lens ASHL

TLDR

This paper introduces the Screen Point-and-Read task, where a model must explain the region indicated by a user point on a GUI screenshot, and proposes the Tree-of-Lens agent to solve it. It also releases the ScreenPR benchmark across mobile, web, and operating-system GUIs plus the ASHL dataset for hierarchical screen-region detection.

Open paper Edit on GitHub Report issue