Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Liya Zhu , Jingzhe Ding , Jian Zhang , Jianbo Xue , Shihao Liang , Ge Zhang , Yi Zhu , Duju Zeng , Xiang Gao , Qingshui Gu , Mailun Gao , Huimin Che , Yan Zhao , Peiheng Zhou , Haojun Wang , Chaobo Xian , Lili Le , Chi Wu , Yiwei Liu , Shengda Long , Jiale Yang , Fangzhi Xu , Sijin Wu , Haodong Duan , Chao He , Zhaojian Li , Minchao Wang , Huan Zhou , Jiani Hou , Chuqian Yu , Weiran Shi , Hongwan Gao , Jiamin Chen , Guanhong Chen , Tingqin Luo , Kaiyuan Zhang , Zhixin Yao , Qing Hua , Yuhao Jiang , Jin Chen , Pu Chen , Zhenyu Hu , Xingyu Li , Zhengxuan Jiang , Meng Cao , Tianfeng Long , Haozhe Wang , Mingzhang Wang , Yichen Zhang , Yiming Dai , Chenchen Zhang , Jiaying Wang , Xinying Liu , Xingzu Liu , Lingling Zhang , Xinjie Chen , Yujia Qin , Wangchunshu Zhou , Zhiyong Wu , Yang Liu , Jiaheng Liu , Lei Zhang , Shen Yan , Wenhao Huang , Zaiyuan Wang , Xiaolong Chang
- 🏛 Institutions
- Unknown
- 📅 Date
- June 9, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
Workflow-GYM is a benchmark for long-horizon computer-use tasks in professional software environments. It evaluates whether agents can complete domain-specific workflows through GUIs and reports that current state-of-the-art models still struggle with end-to-end professional tasks.
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive TasksJanuary 28, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World EnvironmentJanuary 28, 2026 · arXiv
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented EnvironmentsDecember 22, 2025 · arXiv
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?October 21, 2024 · EMNLP 2024 (Poster)
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- GPA: Learning GUI Process Automation from DemonstrationsApril 2, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- OSExpert: Computer-Use Agents Learning Professional Skills via ExplorationMarch 9, 2026 · arXiv
- When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use AgentsFebruary 9, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI AgentsJanuary 25, 2026 · arXiv
- MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning CorrectionJanuary 19, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv