OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie , Danyang Zhang , Jixuan Chen , Xiaochuan Li , Siheng Zhao , Ruisheng Cao , Jing Hua Toh , Zhoujun Cheng , Dongchan Shin , Fangyu Lei , Yitao Liu , Yiheng Xu , Shuyan Zhou , Silvio Savarese , Caiming Xiong , Victor Zhong , Tao Yu

🏛 Institutions: HKU , CMU , Salesforce AI Research , University of Waterloo
📅 Date: April 11, 2024
📑 Publisher: NeurIPS 2024 Datasets and Benchmarks Track
💻 Env: Desktop Web
🔑 Keywords: benchmark OSWorld real computer environment execution-based evaluation multi-app workflows

TLDR

OSWorld provides a real computer environment and benchmark for open-ended tasks across Ubuntu, Windows, and macOS, with 369 tasks spanning real web apps, desktop apps, file I/O, and multi-application workflows. Its execution-based evaluation setup exposes a large gap between humans and current multimodal agents, with the best reported model reaching 12.24% task success versus 72.36% for humans.

Open paper arXiv Report issue