CocoaBench: Evaluating Unified Digital Agents in the Wild
Shibo Hao , Zhining Zhang , Zhiqi Liang , Tianyang Liu , Yuheng Zha , Qiyue Gao , Jixuan Chen , Zilong Wang , Zhoujun Cheng , Haoxiang Zhang , Junli Wang , Hexi Jin , Boyuan Zheng , Kun Zhou , Yu Wang , Feng Yao , Licheng Liu , Yijiang Li , Zhifei Li , Zhengtao Han , Pracha Promthaw , Tommaso Cerruti , Xiaohan Fu , Ziqiao Ma , Jingbo Shang , Lianhui Qin , Julian McAuley , Eric P. Xing , Zhengzhong Liu , Rupesh Kumar Srivastava , Zhiting Hu
- 🏛 Institutions
- UC San Diego , MBZUAI , University of Michigan , UC Berkeley , ETH , University of Cambridge , Gray Swan AI
- 📅 Date
- April 13, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- General GUI
- 🔑 Keywords
CocoaBench evaluates unified digital agents on long-horizon tasks requiring flexible composition of vision, search, and coding. Tasks are specified by an instruction and an automatic evaluation function, enabling reliable scalable evaluation across agent infrastructures. The best-evaluated system reaches only 45.1%, exposing gaps in reasoning, tool use, and visual grounding.
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive TasksJanuary 28, 2026 · arXiv
- MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World EnvironmentJanuary 28, 2026 · arXiv
- MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented EnvironmentsDecember 22, 2025 · arXiv
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?October 21, 2024 · EMNLP 2024 (Poster)
- MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI AgentsMay 18, 2026 · arXiv
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element InjectionApril 9, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI TasksMarch 26, 2026 · CVPR 2026
- See, Plan, Snap: Evaluating Multimodal GUI Agents in ScratchFebruary 11, 2026 · arXiv
- LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial ScenariosFebruary 3, 2026 · arXiv
- ToolTok: Tool Tokenization for Efficient and Generalizable GUI AgentsJanuary 30, 2026 · arXiv
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive PerceptionDecember 26, 2025 · arXiv
- DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing TasksDecember 1, 2025 · AAAI 2026 TrustAgent Workshop