VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

🏛 Institutions: CMU, CUHK, PKU, MBZUAI, Allen Institute for AI
📅 Date: April 9, 2024
📑 Publisher: COLM 2024
💻 Env: Web
🔑 Keywords: benchmark VisualWebBench web page understanding grounding OCR

TLDR

VisualWebBench is a web-page understanding benchmark with 1.5K human-curated instances from 139 real websites covering seven fine-grained tasks such as OCR, understanding, and grounding. The paper uses it to show that current multimodal models still struggle on text-rich pages, especially on grounding and low-resolution inputs.

Open paper Edit on GitHub Report issue