Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang , Hao Zhang , Feng Li , Xueyan Zou , Chunyuan Li , Jianfeng Gao

🏛 Institutions: Microsoft Research
📅 Date: October 17, 2023
📑 Publisher: arXiv
💻 Env
🔑 Keywords: visual prompting Set-of-Mark visual grounding zero-shot region marking

TLDR

Introduces Set-of-Mark prompting, where segmented image regions are overlaid with explicit marks before being passed to a multimodal model. The paper shows that simple region marking can unlock much stronger zero-shot grounding from GPT-4V without fine-tuning.

Open paper arXiv Report issue

Related papers (8)

CocoaBench: Evaluating Unified Digital Agents in the Wild

April 13, 2026 · arXiv
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

April 9, 2026 · arXiv
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

January 30, 2026 · arXiv
How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors

January 29, 2026 · arXiv
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

December 26, 2025 · arXiv
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

December 13, 2024 · arXiv
Improved GUI Grounding via Iterative Narrowing

November 18, 2024 · arXiv
Visual Grounding for User Interfaces

June 16, 2024 · NAACL 2024 Industry Track