Dual-View Visual Contextualization for Web Navigation
Jihyung Kil , Chan Hee Song , Boyuan Zheng , Xiang Deng , Yu Su , Wei-Lun Chao
- 🏛 Institutions
- OSU
- 📅 Date
- February 6, 2024
- 📑 Publisher
- CVPR 2024 (Poster)
- 💻 Env
- Web
- 🔑 Keywords
TLDR
This paper contextualizes each HTML element with its corresponding screenshot region and nearby elements, combining textual and visual features to represent webpage elements more informatively. It evaluates the approach on Mind2Web and reports consistent gains in cross-task, cross-website, and cross-domain settings.
Related papers (24)
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Visual Grounding for User InterfacesJune 16, 2024 · NAACL 2024 Industry Track
- Enhancing Web Agents with a Hierarchical Memory TreeMarch 7, 2026 · arXiv
- OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI GroundingFebruary 25, 2026 · arXiv
- WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI ElementsFebruary 12, 2026 · arXiv
- How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design FactorsJanuary 29, 2026 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and SearchMay 21, 2025 · arXiv
- ScaleTrack: Scaling and back-tracking Automated GUI AgentsMay 1, 2025 · arXiv
- UI-TARS: Pioneering Automated GUI Interaction with Native AgentsJanuary 21, 2025 · arXiv
- Ponder & Press: Advancing Visual GUI Agent towards General Computer ControlDecember 2, 2024 · Findings of ACL 2025
- Improved GUI Grounding via Iterative NarrowingNovember 18, 2024 · arXiv
- OS-ATLAS: A Foundation Action Model for Generalist GUI AgentsOctober 30, 2024 · ICLR 2025 (Spotlight)
- TinyClick: Single-Turn Agent for Empowering GUI AutomationOctober 9, 2024 · INTERSPEECH 2025
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI AgentsOctober 7, 2024 · ICLR 2025 (Oral)
- From Grounding to Planning: Benchmarking Bottlenecks in Web AgentsSeptember 3, 2024 · ECAI 2025
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- GPT-4V(ision) is a Generalist Web Agent, if GroundedJanuary 3, 2024 · ICML 2024
- CogAgent: A Visual Language Model for GUI AgentsDecember 14, 2023 · CVPR 2024 (Highlight)
- Mind2Web: Towards a Generalist Agent for the WebJune 9, 2023 · NeurIPS 2023 Datasets and Benchmarks Track
- GUI-C²: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement LearningMay 29, 2026 · arXiv
- UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI GroundingApril 15, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual FeedbackApril 14, 2026 · arXiv