CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu , Linyao Chen , Dai-Jie Wu , Yanjun Chen , Zecheng Zhang , Xiang Yao , Zhiqiang Xie , Yongchao Chen , Shilong Liu , Bochen Qian , Anjie Yang , Zhaoxuan Jin , Jianbo Deng , Philip Torr , Bernard Ghanem , Guohao Li
- 🏛 Institutions
- KAUST , Eigent.AI , CAMEL-AI.org , UTokyo , CMU , Stanford , Harvard , Tsinghua , SUSTech , Oxford , NU
- 📅 Date
- July 1, 2024
- 📑 Publisher
- Findings of ACL 2025
- 💻 Env
- Desktop Mobile
- 🔑 Keywords
TLDR
CRAB is a benchmark framework for multimodal agents that supports cross-environment tasks and graph-based fine-grained evaluation instead of single-platform end-state scoring. Its CRAB Benchmark-v0 release contains 120 desktop and mobile tasks, and the paper reports a best completion ratio of 38.01% from a single GPT-4o agent.
Related papers (24)
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- On the Robustness of GUI Grounding Models Against Image AttacksApril 7, 2025 · arXiv
- GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented UnderstandingJune 16, 2024 · ICLR 2025 (Poster)
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- Benchmarking Living-Screen-Native GUI Agents on Short-Video PlatformsJune 3, 2026 · arXiv
- AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source ApplicationsMay 26, 2026 · arXiv
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchMay 25, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent EvaluationApril 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-CorrectionApril 7, 2026 · ACL 2026
- Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive AssistantsApril 1, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI AgentsMarch 16, 2026 · CVPR 2026