Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta , Kuang-Huei Lee , Ofir Nachum , Yutaka Matsuo , Aleksandra Faust , Shixiang Shane Gu , Izzeddin Gur
- 🏛 Institutions
- University of Tokyo , Google DeepMind
- 📅 Date
- May 19, 2023
- 📑 Publisher
- ICLR 2024
- 💻 Env
- Web
- 🔑 Keywords
TLDR
This paper studies offline multimodal web-agent training with WebGUM, which takes both webpage screenshots and HTML as input. It also releases 347K demonstrations and shows strong gains on MiniWoB and WebShop, with positive transfer to Mind2Web.
Related papers (24)
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- OpenCUA: Open Foundations for Computer-Use AgentsAugust 12, 2025 · NeurIPS 2025 (Spotlight)
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- SpiritSight Agent: Advanced GUI Agent with One LookMarch 5, 2025 · CVPR 2025 (Poster)
- ShowUI: One Vision-Language-Action Model for GUI Visual AgentNovember 26, 2024 · CVPR 2025 (Poster)
- OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsOctober 30, 2024 · ICLR 2025 (Spotlight)
- CogAgent: A Visual Language Model for GUI AgentsDecember 14, 2023 · CVPR 2024 (Highlight)
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI AgentsMay 27, 2025 · NeurIPS 2025 (Poster)
- Efficient Agent Training for Computer UseMay 20, 2025 · ICLR 2026 (Poster)
- STEVE: A Step Verification Pipeline for Computer-use Agent TrainingMarch 16, 2025 · arXiv
- Falcon-UI: Understanding GUI Before Following User InstructionsDecember 12, 2024 · arXiv
- Aguvis: Unified Pure Vision Agents for Autonomous GUI InteractionDecember 5, 2024 · ICML 2025 (Poster)
- MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI UnderstandingSeptember 23, 2024 · Findings of EMNLP 2024
- ScreenAI: A Vision-Language Model for UI and Infographics UnderstandingFebruary 7, 2024 · IJCAI 2024
- Spotlight: Mobile UI Understanding using Vision-Language Models with a FocusSeptember 29, 2022 · ICLR 2023 (Poster)
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction TracesMarch 5, 2026 · arXiv
- Modeling Distinct Human Interaction in Web AgentsFebruary 19, 2026 · arXiv
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI AgentsFebruary 15, 2026 · arXiv
- UltraCUA: A Foundation Model for Computer Use Agents with Hybrid ActionOctober 20, 2025 · arXiv
- ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform DataSeptember 18, 2025 · ICLR 2026 (Oral)