Dual-View Visual Contextualization for Web Navigation
Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao
- 🏛 Institutions
- OSU
- 📅 Date
- February 6, 2024
- 📑 Publisher
- CVPR 2024 (Poster)
- 💻 Env
- Web
- 🔑 Keywords
TLDR
This paper contextualizes each HTML element with its corresponding screenshot region and nearby elements, combining textual and visual features to represent webpage elements more informatively. It evaluates the approach on Mind2Web and reports consistent gains in cross-task, cross-website, and cross-domain settings.
Related papers
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Visual Grounding for User InterfacesJune 16, 2024 · NAACL 2024 Industry Track
- Enhancing Web Agents with a Hierarchical Memory TreeMarch 7, 2026 · arXiv
- OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI GroundingFebruary 25, 2026 · arXiv
- WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI ElementsFebruary 12, 2026 · arXiv
- How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design FactorsJanuary 29, 2026 · arXiv