Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh
- 🏛 Institutions
- Arizona State University, Amazon (AWS Agentic AI)
- 📅 Date
- March 27, 2026
- 📑 Publisher
- CVPR 2026
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
This paper adapts the discrete diffusion model LLaDA-V to GUI grounding and proposes a hybrid masking schedule for bounding-box prediction. Across web, desktop, and mobile benchmarks, the diffusion model outperforms its linear-masked variant and remains competitive with autoregressive VLMs.
Related papers
- AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction RefinementMarch 18, 2026 · arXiv
- UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI GroundingApril 15, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual FeedbackApril 14, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface ElementsMarch 15, 2026 · arXiv