GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

🏛 Institutions: OSU
📅 Date: January 3, 2024
📑 Publisher: ICML 2024
💻 Env: Web
🔑 Keywords: framework grounding SeeAct live website evaluation Mind2Web

TLDR

SeeAct studies GPT-4V as a generalist web agent and adds an online evaluation setup for running agents on live websites. It shows that GPT-4V is strong when grounding is handled manually, and identifies grounding as the main remaining bottleneck.

Open paper Edit on GitHub Report issue

Related papers

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

March 15, 2026 · arXiv
Enhancing Web Agents with a Hierarchical Memory Tree

March 7, 2026 · arXiv
OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

February 25, 2026 · arXiv
ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution

January 12, 2026 · arXiv
WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

October 26, 2025 · NeurIPS 2025 Workshop on Language Agents and World Models
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

October 22, 2025 · arXiv