Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
- 🏛 Institutions
- Microsoft Research
- 📅 Date
- October 17, 2023
- 📑 Publisher
- arXiv
- 💻 Env
- 🔑 Keywords
TLDR
Introduces Set-of-Mark prompting, where segmented image regions are overlaid with explicit marks before being passed to a multimodal model. The paper shows that simple region marking can unlock much stronger zero-shot grounding from GPT-4V without fine-tuning.
Related papers
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element InjectionApril 9, 2026 · arXiv
- ToolTok: Tool Tokenization for Efficient and Generalizable GUI AgentsJanuary 30, 2026 · arXiv
- How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design FactorsJanuary 29, 2026 · arXiv
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive PerceptionDecember 26, 2025 · arXiv
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv