CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li

🏛 Institutions: KAUST, Eigent.AI, CAMEL-AI.org, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford, NU
📅 Date: July 1, 2024
📑 Publisher: Findings of ACL 2025
💻 Env: Desktop Mobile
🔑 Keywords: benchmark cross-environment tasks graph-based evaluation task generation CRAB

TLDR

CRAB is a benchmark framework for multimodal agents that supports cross-environment tasks and graph-based fine-grained evaluation instead of single-platform end-state scoring. Its CRAB Benchmark-v0 release contains 120 desktop and mobile tasks, and the paper reports a best completion ratio of 38.01% from a single GPT-4o agent.

Open paper Edit on GitHub Report issue