OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

Zilong Wang , Yuedong Cui , Li Zhong , Zimin Zhang , Da Yin , Bill Yuchen Lin , Jingbo Shang

🏛 Institutions: UC San Diego , UCLA , Allen Institute for AI
📅 Date: July 26, 2024
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: benchmark office automation multi-application workflows application switching execution-based evaluation OfficeBench

TLDR

OfficeBench is a benchmark for office automation tasks that require agents to plan across multiple applications, switch contexts correctly, and ground actions inside a large combined action space. The paper reports only 47% pass rate for GPT-4 Omni and highlights redundancy, hallucination, and application-switching errors as core failure modes.

Open paper arXiv Report issue