AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. It is also the only dataset that is large enough to be used for training purposes, whereas other datasets are used for evaluation only. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.

Figure 1: Performance of our AutoSDT-Coder in comparison to other open-weight and proprietary LLMs on ScienceAgentBench. Fine-tuned on AutoSDT-5K based with Qwen2.5-32B-Coder-Instruct, AutoSDT-Coder-32B achieves the same level of performance as GPT-4o (2024-05-13).

Pipeline Overview

As shown in Figure 2, given a few high-level keywords, AutoSDT-Search first searches for related code repositories with keyword expansion. After that, AutoSDT-Select identifies source code files that correspond to data-driven discovery tasks with multi-step filtering strategy and extracts dependencies for their execution environments. Lastly, AutoSDT-Adapt modifies the selected source code files into independently executable programs and generates task instructions accordingly.

Figure 2: Our AutoSDT pipeline collects data-driven discovery tasks in three steps: (1) AutoSDT-Search generates a list of keywords for each discipline and searches for relevant repositories. (2) AutoSDT-Select identifies programs that represent data-driven discovery tasks and extracts their execution dependency folders. (3) AutoSDT-Adapt modifies the selected programs to be independently executable and generates their corresponding task instructions. Our pipeline finally results in AutoSDT-5K, a dataset of 5,404 ecologically valid scientific tasks, each paired with a task instruction and runnable code solution.

Our Dataset: AutoSDT-5K

We use AutoSDT to create AutoSDT-5K, a dataset of 5,404 data-driven discovery tasks, which costs only 0.55 USD per task on average. To our best knowledge, AutoSDT-5K is so far the largest coding dataset for data-driven discovery, covering 756 unique Python packages of computational tools in four disciplines: Bioinformatics, Computational Chemistry, Geographical Information Science, and Psychology and Cognitive Neuroscience.

Figure 3a: Tasks in AutoSDT-5K are multi-step research workflows covering operations ranging from data preprocessing to more advanced analytics.

Figure 3b: Examples of packages in AutoSDT-5K, including general-purpose toolkits and domain-specific packages for bioinformatics, computational chemistry, geographic information science, and psychology and cognitive neuroscience.

We engage 9 subject matter experts from these disciplines, including Ph.D. students and professors, to examine a subset of 256 tasks. The experts report that 93% of the tasks are scientifically authentic and represent parts of their data-driven discovery workflows, and 92.2% of the generated programs are deemed correct solutions to the tasks, validating the high quality of AutoSDT-5K.

Figure 3a: Tasks in AutoSDT-5K are multi-step research workflows covering operations ranging from data preprocessing to more advanced analytics.

Task Examples from AutoSDT-5K

Example tasks in AutoSDT-5K.

Overall Results

Overall, models trained on AutoSDT-5K achieve improved SR and VER on ScienceAgentBench. Specifically, we improve Qwen2.5-Coder-32B by 3.9% SR and 7.6% VER. Furthermore, we notice that the performance gains increase with model size; although the 7B model does not show SR improvements, the performance gains become more evident with the 14B and the 32B model. The performance drop for the 7B model is probably due to limited model capacity or learnability gap.

SR: Success Rate, a binary metric that examines whether a program output meets the human-annotated success criteria for each task goal.
VER: Valid Execution Rate, a binary metric which checks if the program can execute without errors and save its output to the correct location.
HMS: Hypothesis Matching Score, which breaks the generated hypotheses down into sub-hypotheses with an LLM and evaluates them against gold hypotheses.

Table 3: Results of different-sized Qwen2.5-Coder-Instruct models on ScienceAgentBench and DiscoveryBench. All results are generated with zero-shot direct prompting.

Our models outperform open-weight models of significantly larger size; for example, AutoSDT-Coder-32B achieves more than double the performance of the Llama-3.1-Instruct-405B model. Moreover, AutoSDT-Coder-32B outperforms GPT-4o (2024-05-13) and rivals the performance of Claude-3.5-Sonnet-v1 and GPT-4o (2024-11-20). These results show the effectiveness of AutoSDT-5K in generating high-quality scientific data and its potential to train truly open-weight and open-data co-scientist agents. However, we also observe that there is still a large gap with reasoning models (models that incorporate a "thinking" or "chain-of-thought" process before generating the final answer) like OpenAI-o1 and Claude-3.7-Sonnet . We believe that boosting our data with high-quality reasoning trajectories could be a promising direction to explore and leave it as a future work.

Table 4: Performance comparison among different models on ScienceAgentBench.

Scaling Experiments

The 14B model exhibits noticeable gains up to 2.5k examples, after which further scaling does not yield improvement, suggesting the onset of performance saturation. In contrast, the 32B model continues to benefit from additional data, achieving higher success rates as the training set increases to 5k+. This analysis suggests that performance gains from scaling up training data become limited for smaller models, while larger models are able to better utilize increased data for further improvement. Such scaling behavior is consistent with previous findings, which report that the 14B model saturates at approximately 800 training trajectories for addressing software engineering issues, while the 32B still benefits from more training trajectories.

Figure 4: Impact of training set size on ScienceAgentBench performance for AutoDCT-Coder-14B and AutoDCT-Coder-32B. Shaded areas indicate standard deviation across three runs.

Comparison with Existing Benchmarks

We compare AutoSDT-5K against existing science-oriented data analysis datasets in Table 5:

AutoSDT-5K covers a considerably larger set of tasks balanced across multiple disciplines.
Unlike MLE-Bench and DSBench which derive code from competition platforms or RE-Bench and Bix-Bench which use human annotators to curate new coding tasks, AutoSDT-5K is based on naturally-occurring code authored by real-world scientists, ensuring the ecological validity of the tasks.
AutoSDT-5K is the only automatically generated dataset for coding tasks in scientific disciplines. It is also the only dataset that is large enough to be used for training purposes, whereas other datasets are used for evaluation only.

While automatic generation approaches have been applied to software engineering datasets, this work is the first to address the challenging task of collecting high-quality data-driven scientific discovery programs at scale.

Figure 5: Dataset statistics of AutoSDT-5K compared to related datasets. Columns show the number of instances, number of subject domains, and whether the dataset is oriented towards scientific disciplines, based on naturally-occurring code, and automatically collected.

AutoSDT

Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Abstract