GUI Agents Papers
Star · 821

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie , Danyang Zhang , Jixuan Chen , Xiaochuan Li , Siheng Zhao , Ruisheng Cao , Jing Hua Toh , Zhoujun Cheng , Dongchan Shin , Fangyu Lei , Yitao Liu , Yiheng Xu , Shuyan Zhou , Silvio Savarese , Caiming Xiong , Victor Zhong , Tao Yu

🏛 Institutions
HKU , CMU , Salesforce AI Research , University of Waterloo
📅 Date
April 11, 2024
📑 Publisher
NeurIPS 2024 Datasets and Benchmarks Track
💻 Env
Desktop Web
🔑 Keywords
TLDR

OSWorld provides a real computer environment and benchmark for open-ended tasks across Ubuntu, Windows, and macOS, with 369 tasks spanning real web apps, desktop apps, file I/O, and multi-application workflows. Its execution-based evaluation setup exposes a large gap between humans and current multimodal agents, with the best reported model reaching 12.24% task success versus 72.36% for humans.

Open paper arXiv Report issue
Related papers (24)