OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Jing Hua Toh, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

🏛 Institutions: HKU, CMU, Salesforce AI Research, University of Waterloo
📅 Date: April 11, 2024
📑 Publisher: NeurIPS 2024 Datasets and Benchmarks Track
💻 Env: Desktop Web
🔑 Keywords: benchmark OSWorld real computer environment execution-based evaluation multi-app workflows

TLDR

OSWorld provides a real computer environment and benchmark for open-ended tasks across Ubuntu, Windows, and macOS, with 369 tasks spanning real web apps, desktop apps, file I/O, and multi-application workflows. Its execution-based evaluation setup exposes a large gap between humans and current multimodal agents, with the best reported model reaching 12.24% task success versus 72.36% for humans.

Open paper Edit on GitHub Report issue