WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang
- 🏛 Institutions
- HIT-Shenzhen
- 📅 Date
- April 30, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
WindowsWorld targets the gap that existing GUI benchmarks focus on isolated single-application tasks, presenting a process-centric suite of 181 cross-application desktop tasks (avg 5.0 sub-goals across 17 applications, 78% multi-application). Evaluated computer-use agents fall below 21% success on multi-application tasks, substantially trailing single-application performance and exposing weak workflow-level coordination.
Related papers
- GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using AgentsNovember 6, 2025 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv