CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu , Linyao Chen , Dai-Jie Wu , Yanjun Chen , Zecheng Zhang , Xiang Yao , Zhiqiang Xie , Yongchao Chen , Shilong Liu , Bochen Qian , Anjie Yang , Zhaoxuan Jin , Jianbo Deng , Philip Torr , Bernard Ghanem , Guohao Li

🏛 Institutions: KAUST , Eigent.AI , CAMEL-AI.org , UTokyo , CMU , Stanford , Harvard , Tsinghua , SUSTech , Oxford , NU
📅 Date: July 1, 2024
📑 Publisher: Findings of ACL 2025
💻 Env: Desktop Mobile
🔑 Keywords: benchmark cross-environment tasks graph-based evaluation task generation CRAB

TLDR

CRAB is a benchmark framework for multimodal agents that supports cross-environment tasks and graph-based fine-grained evaluation instead of single-platform end-state scoring. Its CRAB Benchmark-v0 release contains 120 desktop and mobile tasks, and the paper reports a best completion ratio of 38.01% from a single GPT-4o agent.

Open paper arXiv Report issue