Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Hiroki Furuta , Kuang-Huei Lee , Ofir Nachum , Yutaka Matsuo , Aleksandra Faust , Shixiang Shane Gu , Izzeddin Gur

🏛 Institutions: University of Tokyo , Google DeepMind
📅 Date: May 19, 2023
📑 Publisher: ICLR 2024
💻 Env: Web
🔑 Keywords: model dataset WebGUM offline training demonstration learning cross-benchmark transfer

TLDR

This paper studies offline multimodal web-agent training with WebGUM, which takes both webpage screenshots and HTML as input. It also releases 347K demonstrations and shows strong gains on MiniWoB and WebShop, with positive transfer to Mind2Web.

Open paper arXiv Report issue

Related papers (24)

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

April 9, 2026 · arXiv
OpenCUA: Open Foundations for Computer-Use Agents

August 12, 2025 · NeurIPS 2025 (Spotlight)
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

May 21, 2025 · NeurIPS 2025 (Spotlight)
SpiritSight Agent: Advanced GUI Agent with One Look

March 5, 2025 · CVPR 2025 (Poster)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

November 26, 2024 · CVPR 2025 (Poster)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024 · ICLR 2025 (Spotlight)
CogAgent: A Visual Language Model for GUI Agents

December 14, 2023 · CVPR 2024 (Highlight)
SecAgent: Efficient Mobile GUI Agent with Semantic Context

March 9, 2026 · arXiv
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

December 31, 2025 · arXiv
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

May 27, 2025 · NeurIPS 2025 (Poster)
Efficient Agent Training for Computer Use

May 20, 2025 · ICLR 2026 (Poster)
STEVE: A Step Verification Pipeline for Computer-use Agent Training

March 16, 2025 · arXiv
Falcon-UI: Understanding GUI Before Following User Instructions

December 12, 2024 · arXiv
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

December 5, 2024 · ICML 2025 (Poster)
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

September 23, 2024 · Findings of EMNLP 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

February 7, 2024 · IJCAI 2024
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

September 29, 2022 · ICLR 2023 (Poster)
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

March 5, 2026 · arXiv
Modeling Distinct Human Interaction in Web Agents

February 19, 2026 · arXiv
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

February 15, 2026 · arXiv
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

October 20, 2025 · arXiv
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

September 18, 2025 · ICLR 2026 (Oral)