Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

🏛 Institutions: Apple
📅 Date: April 8, 2024
📑 Publisher: ECCV 2024 (Poster)
💻 Env: Mobile
🔑 Keywords: benchmark dataset mobile UI understanding grounding reasoning Ferret-UI

TLDR

Ferret-UI is a mobile-screen MLLM that adds an any-resolution screen encoding scheme together with curated training data for elementary UI tasks and advanced reasoning tasks. The paper also releases a benchmark covering those tasks and reports that Ferret-UI outperforms most open UI MLLMs and exceeds GPT-4V on all elementary UI tasks.

Open paper Edit on GitHub Report issue