GUI Agents Papers
Star · 751

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

🏛 Institutions
Apple
📅 Date
April 8, 2024
📑 Publisher
ECCV 2024 (Poster)
💻 Env
Mobile
🔑 Keywords
TLDR

Ferret-UI is a mobile-screen MLLM that adds an any-resolution screen encoding scheme together with curated training data for elementary UI tasks and advanced reasoning tasks. The paper also releases a benchmark covering those tasks and reports that Ferret-UI outperforms most open UI MLLMs and exceeds GPT-4V on all elementary UI tasks.

Open paper Edit on GitHub Report issue
Related papers