GUI Agents Papers
Star · 821

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Keen You , Haotian Zhang , Eldon Schoop , Floris Weers , Amanda Swearngin , Jeffrey Nichols , Yinfei Yang , Zhe Gan

🏛 Institutions
Apple
📅 Date
April 8, 2024
📑 Publisher
ECCV 2024 (Poster)
💻 Env
Mobile
🔑 Keywords
TLDR

Ferret-UI is a mobile-screen MLLM that adds an any-resolution screen encoding scheme together with curated training data for elementary UI tasks and advanced reasoning tasks. The paper also releases a benchmark covering those tasks and reports that Ferret-UI outperforms most open UI MLLMs and exceeds GPT-4V on all elementary UI tasks.

Open paper arXiv Report issue
Related papers (24)