Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao

🏛 Institutions: Microsoft Research, University of Maryland, University of Wisconsin-Madison, KAIST, University of Washington
📅 Date: February 18, 2025
📑 Publisher: CVPR 2025
💻 Env
🔑 Keywords: model foundation model SoM ToM UI navigation robot manipulation Magma

TLDR

Magma is a multimodal foundation model for agentic tasks spanning digital and physical environments rather than a GUI-specific paper. It is relevant here because it reports strong UI navigation results and uses Set-of-Mark and Trace-of-Mark supervision, but its main contribution is a broader agentic model covering robotics as well as GUI tasks.

Open paper Edit on GitHub Report issue