GUI Agents Papers
Star · 751

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

🏛 Institutions
CMU, CUHK, PKU, MBZUAI, Allen Institute for AI
📅 Date
April 9, 2024
📑 Publisher
COLM 2024
💻 Env
Web
🔑 Keywords
TLDR

VisualWebBench is a web-page understanding benchmark with 1.5K human-curated instances from 139 real websites covering seven fine-grained tasks such as OCR, understanding, and grounding. The paper uses it to show that current multimodal models still struggle on text-rich pages, especially on grounding and low-resolution inputs.

Open paper Edit on GitHub Report issue
Related papers