VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

🏛 Institutions: CMU
📅 Date: January 24, 2024
📑 Publisher: ACL 2024
💻 Env: Web
🔑 Keywords: benchmark VisualWebArena visually grounded web tasks self-hosted environments multimodal agent evaluation

TLDR

VisualWebArena is a benchmark of 910 visually grounded web tasks across Classifieds, Shopping, and Reddit environments. Built on WebArena's self-hosted setup, it targets multimodal web agents that must use image-text inputs rather than text alone.

Open paper Edit on GitHub Report issue