Logo SeeAct

GPT-4V(ision) is a Generalist Web Agent, if Grounded

The Ohio State University

SeeAct is a generalist web agent based on large multimodal models (LMMs) like GPT-4V. Specifically, given a task on any website (e.g., “Compare iPhone 15 Pro Max with iPhone 13 Pro Max” on the Apple homepage), the agent first performs Action Generation to produce a textual description of the action at each step towards completing the task (e.g., “Navigate to the iPhone category”), and then performs Action Grounding to identify the corresponding HTML element (e.g., “[button] iPhone”) and operation (e.g., CLICK, TYPE, or SELECT) on the webpage.

With online evaluation on the Mind2Web dataset, SeeAct can successfully complete up to 50% of tasks on live websites given an oracle action grounding method. It also exhibits remarkable capabilities, ranging from speculative planning, webpage content reasoning, to self-correction of mistakes.

Authentic video with SeeAct agent working on live websites (sped-up version).

Logo SeeAct Method


SeeAct first performs Action Generation by leveraging an LMM, like GPT-4V, to visually perceive websites and generate plans in textual forms. We explicitly instruct GPT-4V to imitate humans browsing a webpage and analyze the task, webpage, and previous actions. It is asked to generate an action description based on its analysis and reasoning. Action Grounding is the next step to ground textual plans to the HTML elements and operations to act on the website.

algebraic reasoning

Action Grounding

Despite the capability of LMMs in identifying and describing the next action to complete the given task in natural language, it is still challenging to convert the action description into an executable action within the environment. To address the challenge of action grounding, we explore three approaches using different types of information: Grounding via Element Attributes, Grounding via Textual Choices, and Grounding via Image Annotation.

algebraic reasoning

    This is an example to demonstrates the three approaches of element grounding for a single action during completing the given task with three different methods. In this action step, the model needs to convert the textual description into the action of clicking the "Find Your Truck" button to perform a search.
  • Textual Choices: Some element candidates represented with HTML text are given, the model is required to generate the choice index of the target element.
  • Image Annotation: Bounding boxes and index labels are added to the image, the model is required to generate the label on the bottom-left of the target element.
  • Element Attributes: The model needs to predict the text and type of the target element.

Experiments and Results

We compare SeeAct with other models following MindAct's two-stage strategy. We evaluate supervised fine-tuning (SFT) methods using FLAN-T5 and BLIP-2-T5 and in-context learning (ICL) methods using GPT-3.5, GPT-4.

algebraic reasoning
    We observe the following results in the experiments:
  • (1) SeeAct with GPT-4V is a strong generalist web agent if oracle grounding is provided, which substantially outperforms existing methods like GPT-4 or FLAN-T5
  • (2) Grounding is still a major challenge. The best grounding strategy still has a 20-25% gap with oracle grounding
  • (3) In-context learning with large models (both LMMs and LLMs) shows better generalization to unseen websites, while supervised fine-tuning still has an edge on websites seen during training

Logo Online Evaluation on Live Websites

We develop a new online evaluation tool using Playwright to evaluate web agents on live websites. Our tool can convert the predicted action into a browser event and execute it on the website. To adhere to ethical standards, our experiments are restricted to non-login tasks in compliance with user agreements, and we closely monitor agent activities during online evaluation to prevent any actions that have potentially harmful impacts.


SeeAct can successfully complete 50% of tasks on different websites if provided an oracle grounding method. We further investigate the performance of web agents on tasks across different difficulty levels. We estimate the task difficulty based on the number of actions taken by annotators during action trace annotation.

algebraic reasoning
Whole task success rate across task difficulty levels. Easy: 2-4, Medium: 5-7, and Hard: 8-12.
your second image description
Whole task success rate (%) under both offline and online evaluation. Offline0 and Offline1 refer to no tolerance for error at any step and allowing for error at one step, respectively.

Logo Case Study

Promising Capabilities

GPT-4V exhibits promising capabilities, ranging from speculative planning, webpage content reasoning, and error correction to surpassing the limitations of superficial textual similarity matching inherent in fine-tuned, text-only models.

Error Cases in Grounding via Image Annotation

    To analyze the reasons behind the failures, we randomly sample 100 predicted actions and observe the major types of errors as:
  • (1) Wrong action generation
  • (2) Making up bounding box & label
  • (3) Failure to link bounding boxes with the correct labels


        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},

        title={Mind2Web: Towards a Generalist Agent for the Web},
        author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
        booktitle={Thirty-seventh Conference on Neural Information Processing Systems},