AutoSDT

Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Yifei Li1,†*, Hanane Nour Moussa1,†*, Ziru Chen1,†, Shijie Chen1,†, Botao Yu1,†, Mingyi Xue6,◇, Benjamin Burns1,†, Tzu-Yao Chiu3, Vishal Dey1,†, Zitong Lu3,†, Chen Wei5,◇, Qianheng Zhang5,◇, Tianyu Zhang3, Song Gao5,◇, Xuhui Huang6,◇, Xia Ning1,2,4,†, Nesreen K. Ahmed, Ali Payani, Huan Sun1,†

1 Department of Computer Science and Engineering 2 College of Pharmacy 3 Department of Psychology 4 Department of Biomedical Informatics 5 Department of Geography 6 Department of Chemistry
○ Cisco Research ◇ University of Wisconsin–Madison
† The Ohio State University
* Correspondence to li.14042@osu.edu, moussa.45@osu.edu, sun.397@osu.edu

Overview

Despite long-standing efforts in accelerating scientific discovery with AI, building reliable AI co-scientists remains challenging due to the lack of high-quality data for training and evaluation. To address this data scarcity problem, we introduce AutoSDT—an automatic pipeline that collects high-quality coding tasks from real-world data-driven discovery workflows.

AutoSDT leverages the coding capabilities and parametric knowledge of large language models (LLMs) to search from diverse sources, identify ecologically valid scientific tasks, and synthesize both task instructions and code solutions automatically. Using this pipeline, we construct AutoSDT-5K, a dataset of 5,404 scientific coding tasks spanning four scientific disciplines and using 756 unique Python packages.

  • AutoSDT-5K is the largest and the only automatically collected open dataset for data-driven scientific discovery.
  • Models trained on AutoSDT-5K, named AutoSDT-Coder,
  • AutoSDT-Coder-32B reaches GPT-4o-level performance on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model.
  • It also improves hypothesis matching score by 17.4% on DiscoveryBench, significantly narrowing the gap between open-weight models and proprietary ones.


AutoSDT Pipeline

Pipeline of AutoSDT. The AutoSDT pipeline consists of three main stages to automatically construct data-driven scientific coding tasks. (1) AutoSDT-Search crawls GitHub and Papers with Code using seed keywords and filters repositories relevant to scientific research. (2) AutoSDT-Select identifies Python files containing data-driven scientific code and extracts their dependency folders. (3) AutoSDT-Adapt adapts these programs for standalone execution and generates natural language instructions to form finalized tasks. Right: The resulting AutoSDT-5K dataset includes 5,404 ecologically valid scientific tasks, each paired with a task instruction and runnable code solution.

Dataset Composition Analysis

Subtask Distribution: AutoSDT-5K covers diverse workflow components from data preprocessing to advanced analytics.

Package Ecosystem: 756 unique packages including domain-specific tools like rdkit, geopandas, and scanpy.

Table 1: Dataset Summary

Table 1: AutoSDT-5K summary statistics across four scientific domains, including task counts, average lines of code, and unique packages.


Model Performance on AutoSDT

Figure 1: AutoSDT-Coder vs other LLMs on ScienceAgentBench

Figure 1: AutoSDT-Coder-32B reaches GPT-4o-level performance on ScienceAgentBench and significantly improves verification accuracy compared to other open-weight and proprietary models.

Task Examples from AutoSDT

AutoSDT automatically collects high-quality coding tasks from real-world data-driven scientific discovery workflows. Each task is synthesized by a large language model and includes natural language instructions, dataset specifications, and a runnable Python solution. The goal is to benchmark language agents on realistic, ecologically valid scientific tasks that require data analysis, modeling, and visualization.

  1. Task Instruction, which describes the goal of an essential task in data-driven discovery and its output requirements.
  2. Dataset Information, which contains the dataset's directory structure and a preview of its content.
  3. Expert-Provided Knowledge, which includes explanations for scientific terms, formulas to conduct analysis, and example usages of programming tools.
  4. Annotated Program, which is adapted from an open-source code repository released by a peer-reviewed scientific publication

Example tasks in AutoSDT.


Evaluation

We comprehensively evaluate AutoSDT from both automatic and human perspectives:

  • Automatic Evaluation: We use success rate (SR), verification rate (VER), and hypothesis matching score (HMS), along with code-level similarity metrics including Exact Match (EM) and CodeBLEU, to assess whether the synthesized Python programs solve the assigned scientific tasks accurately and reliably.
  • Human Evaluation: We sample 256 tasks and invite domain experts to judge the quality of instructions and solutions. Results show that 93% of tasks are ecologically valid, and 92.2% of programs are functionally correct.
  • Downstream Performance: Models trained on AutoSDT-5K, dubbed AutoSDT-Coder, achieve strong performance on two challenging discovery benchmarks.
Table 2: Human evaluation of ecological validity and functional correctness

Table 2: Human evaluation confirms 93% ecological validity and 92.2% functional correctness across 256 sampled tasks.

Table 3: Model performance before and after fine-tuning on AutoSDT-5K

Table 3: Fine-tuning on AutoSDT-5K significantly improves the performance of Qwen2.5-based open-weight models on scientific discovery tasks.


Comparison with Existing Benchmarks

AutoSDT introduces a new paradigm for benchmark construction by leveraging large language models (LLMs) to automatically mine and adapt real-world scientific code. Compared to existing benchmarks, it offers distinct advantages across multiple dimensions:

  • Construction Method: Existing benchmarks like ScienceAgentBench and DiscoveryBench rely on manual task design. In contrast, AutoSDT is constructed automatically via LLM-driven code mining and adaptation.
  • Scale and Domain Diversity: AutoSDT-5K includes 5,404 tasks across four scientific domains and 756 unique Python packages, making it the largest ecologically valid dataset of its kind.
  • Ecological Validity: Tasks in AutoSDT are derived from real-world codebases used in scientific publications, ensuring high authenticity. Human evaluation confirms 93% of tasks are ecologically valid.
  • Downstream Model Performance: Fine-tuning on AutoSDT enables open-weight models like AutoSDT-Coder-32B to match the performance of proprietary models such as GPT-4o on ScienceAgentBench and significantly boost hypothesis matching on DiscoveryBench.
Table 4: AutoSDT-Coder vs other models on existing benchmarks

Table 4: Comparison of model performance on ScienceAgentBench and DiscoveryBench. AutoSDT-Coder-32B achieves the highest performance among open-weight models and narrows the gap with proprietary models.


Disclaimer

AutoSDT is constructed from publicly available scientific data, publications, and open-source code repositories. We have made every effort to cite original sources and respect the intellectual property of their creators. Details including references and licenses are provided in Appendix I of the AutoSDT paper. If any repository owner wishes to request removal or correction of relevant content, we welcome communication to ensure proper attribution and compliance.

Citation

@misc{chen2024scienceagentbenchrigorousassessmentlanguage,
  title={ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery}, 
  author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
  year={2025},
  eprint={2410.05080},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2410.05080}, 
  }