MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang , Mingfei Gao , Zhe Gan , Philipp Dufter , Nina Wenzel , Forrest Huang , Dhruti Shah , Xianzhi Du , Bowen Zhang , Yanghao Li , Sam Dodge , Keen You , Zhen Yang , Aleksei Timofeev , Mingze Xu , Hong-You Chen , Jean-Philippe Fauconnier , Zhengfeng Lai , Haoxuan You , Zirui Wang , Afshin Dehghan , Peter Grasch , Yinfei Yang
- 🏛 Institutions
- Apple
- 📅 Date
- September 30, 2024
- 📑 Publisher
- ICLR 2025 (Poster)
- 💻 Env
- 🔑 Keywords
This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv
- Training Computer Use Agents to Assess the Usability of Graphical User InterfacesApril 28, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI AgentsApril 13, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element InjectionApril 9, 2026 · arXiv
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI AgentsFebruary 15, 2026 · arXiv
- UI-Oceanus: Scaling GUI Agents with Synthetic Environmental DynamicsFebruary 11, 2026 · arXiv
- ToolTok: Tool Tokenization for Efficient and Generalizable GUI AgentsJanuary 30, 2026 · arXiv
- How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design FactorsJanuary 29, 2026 · arXiv