We comprehensively evaluate AutoSDT from both automatic and human perspectives:
- Automatic Evaluation: We use success rate (SR), verification rate (VER), and hypothesis matching score (HMS), along with code-level similarity metrics including Exact Match (EM) and CodeBLEU, to assess whether the synthesized Python programs solve the assigned scientific tasks accurately and reliably.
- Human Evaluation: We sample 256 tasks and invite domain experts to judge the quality of instructions and solutions. Results show that 93% of tasks are ecologically valid, and 92.2% of programs are functionally correct.
- Downstream Performance: Models trained on AutoSDT-5K, dubbed AutoSDT-Coder, achieve strong performance on two challenging discovery benchmarks.