Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
- 🏛 Institutions
- Apple
- 📅 Date
- April 8, 2024
- 📑 Publisher
- ECCV 2024 (Poster)
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
Ferret-UI is a mobile-screen MLLM that adds an any-resolution screen encoding scheme together with curated training data for elementary UI tasks and advanced reasoning tasks. The paper also releases a benchmark covering those tasks and reports that Ferret-UI outperforms most open UI MLLMs and exceeds GPT-4V on all elementary UI tasks.
Related papers
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- Turing Test on Screen: A Benchmark for Mobile GUI Agent HumanizationFebruary 24, 2026 · arXiv
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the WildFebruary 12, 2026 · arXiv
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe SynthesisJanuary 26, 2026 · arXiv