ScienceAgentBench

Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen*1, Shijie Chen*1, Yuting Ning1, Qianheng Zhang3, Boshi Wang1, Botao Yu1, Yifei Li1, Zeyi Liao1, Chen Wei3, Zitong Lu4, Vishal Dey1, Mingyi Xue5, Frazier N. Baker16, Benjamin Burns1, Daniel Adu-Ampratwum2, Xuhui Huang5, Xia Ning126, Song Gao3, Yu Su1, Huan Sun*1

1Department of Computer and Science Engineering, OSU 2College of Pharamacy, OSU
3Department of Geography, UW-Madison 4Department of Psychology, OSU
5Department of Chemistry, UW-Madison 6Department of Biomedical Informatics, OSU
* Correspondance to chen.8336@osu.edu, chen.10216@osu.edu, sun.397@osu.edu

Task demos of ScienceAgentBench built with OpenHands.

Bioinformatics: Train a cell counting model on the BBBC002 datasets.

Computational Chemistry: Train a multitask model on the Clintox dataset to predict a drug's toxicity and FDA approval status.

Geographical Information Science: Analyze and visualize Elk movement by estimating home ranges and assessing habitat preferences with spatial analysis techniques.

Psychology & Cognitive Neuroscience: Perform RRV analysis by cleaning the RSP signal and extracting its inhalation peaks & respiratory rate.

Overview

The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery:

  • To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them.
  • We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs.
  • Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns.


geometric reasoning

Overview of ScienceAgentBench. Top: Distribution of sub-tasks in ScienceAgentBench. Each task in our benchmark consists of one or more of these sub-tasks and requires successful completion of all sub-tasks to achieve the task goal. Bottom: Heterogeneous datasets involved: (a) a cell image in Bioinformatics, (b) a molecular activity visualization in Computational Chemistry, (c) a flooding risk map in Geographical Information Science, and (d) an EEG time series in Psychology and Cognitive Neuroscience.


Tasks in ScienceAgentBench

ScienceAgentBench aims to evaluate agents on essential tasks in a data-driven discovery workflow. Before automating the entire workflow end-to-end, we envision language agents to first serve as science co-pilots that can write code to process, analyze, and visualize data. Similar to co-pilots for software development, we target scientist users who might know how to write such code but want to save hours of programming effort with language agents. Hence, we formulate each task as a code generation problem, whose output is easily verifiable and directly usable by a scientist without additional modification efforts. Given a natural language instruction, a dataset, and some optional expert-provided knowledge, an agent shall generate a program to complete the assigned task and save it to Python source code file. Each instance in our benchmark contains four components:

  1. Task Instruction, which describes the goal of an essential task in data-driven discovery and its output requirements.
  2. Dataset Information, which contains the dataset's directory structure and a preview of its content.
  3. Expert-Provided Knowledge, which includes explanations for scientific terms, formulas to conduct analysis, and example usages of programming tools.
  4. Annotated Program, which is adapted from an open-source code repository released by a peer-reviewed scientific publication

Example tasks in ScienceAgentBench.


Evaluation

We comprehensively evaluate each generated program with four metrics:

  • Valid Execution Rate (VER) checks if the program can execute without errors and save its output with the correct file name.
  • Success Rate (SR) examines whether a program output meets the success criteria for each task goal, such as test set performance, prediction-answer matches, and visualization quality. To automatically check these criteria, we implement them as evaluation programs for each task during annotation.
  • CodeBERTScore (CBS) (Zhou et al., 2023) measures how closely the generated program resembles the annotated one with contextual embeddings and calculates the F1 metric for matched token embeddings.
  • API Cost (Cost) calculates the average cost (in USD) to complete one task in our benchmark, since it is important for language agents to control their cost and optimize their design for better practical utility (Kapoor et al., 2024).

data-overview

Representative examples of task-specific success criteria in ScienceAgentBench. To keep the table concise, we omit output requirements in the task instructions and show the task goals.


Comparision with Existing Benchmarks

ScienceAgentBench differs from other benchmarks with a unique ensemble of research challenges:

  • Tasks in our benchmark require an agent to generate a standalone program file from scratch, in contrast to JSON API calls in TaskBench, abstract workflow descriptions in DiscoveryBench, or a few lines of code completion or edits in other benchmarks. To do so, an agent needs to have a deep understanding of the task, decompose it into classes and functions appropriately, and implement them.
  • Our benchmark adapts 44 peer-reviewed publications and covers a variety of real-world datasets in four different disciplines. Compared to ML-Bench and DiscoveryBench, ScienceAgentBench includes more heterogeneous datasets that have complex structures, such as cell images, chemical structure-activity relationships, and geographical maps with multiple layers.
  • ScienceAgentBench is also one of the two benchmarks that tries to mitigate data contamination and agent shortcut issues, which helps establish valid evaluation.

data-overview

Comparison of ScienceAgentBench to representative existing benchmarks.

Disclaimer

Our benchmark is constructed by adapting open-source code and data, to which we respect their creators' ownership and intellectual property. In Appendix I of our paper, we have made our best effort to cite the original papers, list the repositories, and provide their licenses. Still, we acknowledge that two repositories (rasterio/rasterio and hackingmaterials/matminer) are copyrighted and believe their terms for use are compatible with our research purpose. We welcome requests from the original authors to modify or remove relevant tasks related to those two repositories if needed.

Citation

@misc{chen2024scienceagentbenchrigorousassessmentlanguage,
  title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery}, 
  author={Ziru Chen and Shijie Chen and Yuting Ning and Qianheng Zhang and Boshi Wang and Botao Yu and Yifei Li and Zeyi Liao and Chen Wei and Zitong Lu and Vishal Dey and Mingyi Xue and Frazier N. Baker and Benjamin Burns and Daniel Adu-Ampratwum and Xuhui Huang and Xia Ning and Song Gao and Yu Su and Huan Sun},
  year={2024},
  eprint={2410.05080},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.05080}, 
  }