Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Ruisheng Cao , Fangyu Lei , Haoyuan Wu , Jixuan Chen , Yeqiao Fu , Hongcheng Gao , Xinzhuang Xiong , Hanchong Zhang , Yuchen Mao , Wenjing Hu , Tianbao Xie , Hongsheng Xu , Danyang Zhang , Sida Wang , Ruoxi Sun , Pengcheng Yin , Caiming Xiong , Ansong Ni , Qian Liu , Victor Zhong , Lu Chen , Kai Yu , Tao Yu

🏛 Institutions: HKU , SJTU , Google Cloud AI Research , Google DeepMind , Salesforce AI Research , Yale University , Sea AI Lab , University of Waterloo
📅 Date: July 15, 2024
📑 Publisher: NeurIPS 2024 Datasets and Benchmarks Track (Poster)
💻 Env: Desktop
🔑 Keywords: benchmark dataset enterprise data software code and GUI data workflows Spider2-V

TLDR

Spider2-V is a benchmark for automating professional data science and engineering workflows that require both code generation and GUI control in enterprise software. It contains 494 real-world tasks across 20 applications and finds that current multimodal agents still struggle badly with full workflows, fine-grained GUI actions, and remote cloud-hosted workspaces.

Open paper arXiv Report issue