SeeAct

SeeAct is a generalist web agent based on large multimodal models (LMMs) like GPT-4V. Specifically, given a task on any website (e.g., “Compare iPhone 15 Pro Max with iPhone 13 Pro Max” on the Apple homepage), the agent first performs Action Generation to produce a textual description of the action at each step towards completing the task (e.g., “Navigate to the iPhone category”), and then performs Action Grounding to identify the corresponding HTML element (e.g., “[button] iPhone”) and operation (e.g., CLICK, TYPE, or SELECT) on the webpage.

With online evaluation on the Mind2Web dataset, SeeAct can successfully complete up to 50% of tasks on live websites given an oracle action grounding method. It also exhibits remarkable capabilities, ranging from speculative planning, webpage content reasoning, to self-correction of mistakes.

Authentic video with SeeAct agent working on live websites (sped-up version).

Updates

2024/5/18: Support for Gemini and LLaVA!
2024/5/1: SeeAct has been accepted to ICML'24!
2024/4/28: Released SeeAct Python Package, with many updates and many features on the way. Have a try with `pip install seeact`
2024/3/18: Multimodal-Mind2Web dataset released. We have paired each HTML document with the corresponding webpage screenshot image and saved the trouble of downloading Mind2Web Raw Dump.

Overview

SeeAct first performs Action Generation by leveraging an LMM, like GPT-4V, to visually perceive websites and generate plans in textual forms. We explicitly instruct GPT-4V to imitate humans browsing a webpage and analyze the task, webpage, and previous actions. It is asked to generate an action description based on its analysis and reasoning. Action Grounding is the next step to ground textual plans to the HTML elements and operations to act on the website.

Action Grounding

Despite the capability of LMMs in identifying and describing the next action to complete the given task in natural language, it is still challenging to convert the action description into an executable action within the environment. To address the challenge of action grounding, we explore three approaches using different types of information: Grounding via Element Attributes, Grounding via Textual Choices, and Grounding via Image Annotation.

Textual Choices: Some element candidates represented with HTML text are given, the model is required to generate the choice index of the target element.
Image Annotation: Bounding boxes and index labels are added to the image, the model is required to generate the label on the bottom-left of the target element.
Element Attributes: The model needs to predict the text and type of the target element.

An example of action generation in grounding via textual choices.

An example of grounding via textual choices after action generation.

An example of action generation in grounding via image annotation.

An example of grounding via image annotation after action generation.

An example of action generation in grounding via element attributes.

An example of grounding via element attributes after action generation.

Experiments and Results

We compare SeeAct with other models following MindAct's two-stage strategy. We evaluate supervised fine-tuning (SFT) methods using FLAN-T5 and BLIP-2-T5 and in-context learning (ICL) methods using GPT-3.5, GPT-4.

(1) SeeAct with GPT-4V is a strong generalist web agent if oracle grounding is provided, which substantially outperforms existing methods like GPT-4 or FLAN-T5
(2) Grounding is still a major challenge. The best grounding strategy still has a 20-25% gap with oracle grounding
(3) In-context learning with large models (both LMMs and LLMs) shows better generalization to unseen websites, while supervised fine-tuning still has an edge on websites seen during training

Results

SeeAct can successfully complete 50% of tasks on different websites if provided an oracle grounding method. We further investigate the performance of web agents on tasks across different difficulty levels. We estimate the task difficulty based on the number of actions taken by annotators during action trace annotation.

your second image description — Whole task success rate (%) under both offline and online evaluation. Offline0 and Offline1 refer to no tolerance for error at any step and allowing for error at one step, respectively.

Promising Capabilities

GPT-4V exhibits promising capabilities, ranging from speculative planning, webpage content reasoning, and error correction to surpassing the limitations of superficial textual similarity matching inherent in fine-tuned, text-only models.

In this example, the model generates a comprehensive plan for the task, including subsequent actions on the following pages that are not currently visible.

In this case, the textual history does not adequately capture two critical pieces of information. Firstly, the website automatically sets the drop-off date to the same day. Secondly, the ’No’ button was selected (However, it was missed in previous actions due to the button’s lack of text). Nevertheless, the model discerns these details through a meticulous screenshot analysis, enabling it to make the correct decision for the next step.

In this example, the task requires knowledge of the IATA code for Los Cabos International Airport. GPT-4V accurately provides the correct code.

In this example, the webpage displays an error message indicating an invalid phone number, a consequence of prior actions. The model identifies this error and prioritizes its immediate rectification, foregoing the subsequent planned steps.

Error Cases in Grounding via Image Annotation

(1) Wrong action generation
(2) Making up bounding box & label
(3) Failure to link bounding boxes with the correct labels

In this example, the model describes a correct element in action generation. However, the identified element is absent from the set of candidate elements. Despite this, the model erroneously assigns it the index number "12".

In this example, the model predicts the appropriate element. Nevertheless, the identified element is not present in the provided image options. Despite this, the model erroneously assigns it with the label ’5’ nearby.

In this case, while the model predicts the appropriate element, it incorrectly associates the element with the nearby label '10' instead of the correct label '11'.

In this case, while the model predicts the appropriate element, it incorrectly associates the element with the nearby label '7' instead of the correct label '6'.

BibTeX

@article{zheng2023seeact, title={GPT-4V(ision) is a Generalist Web Agent, if Grounded}, author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su}, journal={arXiv preprint arXiv:2401.01614}, year={2024}, }

@inproceedings{deng2023mindweb, title={Mind2Web: Towards a Generalist Agent for the Web}, author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems}, year={2023}, url={https://openreview.net/forum?id=kiYqbO3wqw} }

SeeAct

GPT-4V(ision) is a Generalist Web Agent, if Grounded

SeeAct Method