ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun , Zhoumianze Liu , Chang Ma , Zichen Ding , Fangzhi Xu , Zhangyue Yin , Haiteng Zhao , Zhenyu Wu , Kanzhi Cheng , Zhaoyang Liu , Jianing Wang , Qintong Li , Xiangru Tang , Tianbao Xie , Xiachong Feng , Xiang Li , Ben Kao , Wenhai Wang , Biqing Qi , Lingpeng Kong , Zhiyong Wu

🏛 Institutions: HKU , Shanghai AI Laboratory , Fudan , PKU , NJU , East China Normal University , Yale University
📅 Date: May 26, 2025
📑 Publisher: ICLR 2026 (Poster)
💻 Env: Desktop
🔑 Keywords: benchmark environment scientific workflows scientific discovery ScienceBoard

TLDR

ScienceBoard introduces a realistic scientific environment and a 169-task benchmark spanning six domains with integrated professional software and mixed-interface workflows. Evaluations with current multimodal agents reach only about 15% overall success, showing that autonomous scientific assistance remains far from reliable.

Open paper arXiv Report issue