macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

🏛 Institutions: Show Lab , NUS
📅 Date: June 4, 2025
📑 Publisher: NeurIPS 2025 (Poster)
💻 Env: Desktop
🔑 Keywords: benchmark multilingual safety macOSWorld

TLDR

macOSWorld is the first interactive benchmark for GUI agents on macOS, covering 202 multilingual tasks across 30 applications and a dedicated safety subset for deception attacks. The evaluation shows large performance gaps between proprietary and open-source agents, substantial multilingual degradation, and unresolved safety weaknesses on macOS-specific workflows.

Open paper arXiv Report issue