SCUBA: Salesforce Computer Use Benchmark
Yutong Dai , Krithika Ramakrishnan , Jing Gu , Matthew Fernandez , Yanqi Luo , Viraj Prabhu , Zhenyu Hu , Silvio Savarese , Caiming Xiong , Zeyuan Chen , Ran Xu
- 🏛 Institutions
- Salesforce AI Research
- 📅 Date
- September 30, 2025
- 📑 Publisher
- ICLR 2026 (Poster)
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
SCUBA is a benchmark for computer-use agents on Salesforce customer-relationship-management workflows, with 300 task instances derived from real user interviews across administrator, sales, and service personas. It runs in Salesforce sandbox environments with interpretable milestone evaluation and shows that enterprise tasks remain much harder than standard CUA benchmarks, especially for open models in zero-shot settings.
Related papers (24)
- AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding BenchmarkApril 27, 2026 · arXiv
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsApril 15, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI ReasoningApril 8, 2026 · Findings of ACL 2026
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI TasksMarch 26, 2026 · CVPR 2026
- See, Plan, Snap: Evaluating Multimodal GUI Agents in ScratchFebruary 11, 2026 · arXiv
- LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial ScenariosFebruary 3, 2026 · arXiv
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- GUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsJanuary 26, 2026 · arXiv
- DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing TasksDecember 1, 2025 · AAAI 2026 TrustAgent Workshop
- MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI AgentsNovember 30, 2025 · arXiv
- Beyond Clicking: A Step Towards Generalist GUI Grounding via Text DraggingNovember 7, 2025 · arXiv
- Scaling Computer‑Use Grounding via User Interface Decomposition and SynthesisMay 19, 2025 · NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction SynthesisApril 15, 2025 · Findings of ACL 2025
- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens GroundingJune 27, 2024 · EMNLP 2024 (Poster)
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsMay 30, 2023 · NeurIPS 2023
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- Benchmarking Living-Screen-Native GUI Agents on Short-Video PlatformsJune 3, 2026 · arXiv
- AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source ApplicationsMay 26, 2026 · arXiv
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchMay 25, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv