GUI Agents Papers
Star · 751

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Jing Hua Toh, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

🏛 Institutions
HKU, CMU, Salesforce AI Research, University of Waterloo
📅 Date
April 11, 2024
📑 Publisher
NeurIPS 2024 Datasets and Benchmarks Track
💻 Env
Desktop Web
🔑 Keywords
TLDR

OSWorld provides a real computer environment and benchmark for open-ended tasks across Ubuntu, Windows, and macOS, with 369 tasks spanning real web apps, desktop apps, file I/O, and multi-application workflows. Its execution-based evaluation setup exposes a large gap between humans and current multimodal agents, with the best reported model reaching 12.24% task success versus 72.36% for humans.

Open paper Edit on GitHub Report issue
Related papers