ChemAgent

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

Botao Yu^†^C, Frazier N. Baker^*^CB, Ziru Chen^*^C, Garrett Herb^C, Boyu Gou^C,
Daniel Adu-Ampratwum^P, Xia Ning^BCP, Huan Sun^†^C

^* Equal contribution
^† Correspondence to: { yu.3737, sun.397 }@osu.edu

^C Department of Computer and Science Engineering, OSU
^B Department of Biomedical Informatics, OSU
^P College of Pharamacy, OSU

Paper Code

Abstract: To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools}; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.

To explore and enhance the capabilities of agents in diverse and complex chemistry scenarios, we introduce ChemAgent, an advanced language agent designed for a wide range of chemistry tasks. It has a comprehensive set of 29 tools, including general tools (PythonREPL, WebSearch, etc.), molecule tools (name convertors, molecular property predictors, etc.), and reaction tools (ForwardSynthesis, Retrosynthesis).

The used datasets are listed in the table below.

The following table shows the overall performance on the specialized tasks (the SMolInstruct dataset) and the general questions (the MMLU-Chemistry and GPQA-Chemistry dataset).

To understand the errors that ChemAgent makes, for all the samples where ChemAgent (GPT) fails, we engage a chemistry expert to analyze the problem solving process and identify the errors, which are categorized into three types: reasoning error, grounding error, and tool error. The following bar charts show the error distribution.

Main takeaways:

(1) The proposed ChemAgent can consistently outperform ChemCrow, a pioneer chemistry agent.

(2) Compared to the base LLMs without tools, ChemAgent, with the help of the dedicated tools, can achieve much better performance on the specialized tasks in SMolInstruct. However, it suprisingly underperforms the base LLMs on general questions from MMLU-Chemistry and GPQA-Chemistry.

(3) The error analysis reveals that, on general chemistry questions, ChemAgent makes many reasoning errors, and it's underperformance is primarily due to delicate mistakes at intermediate stages of its problem-solving process, such as wrong reasoning steps and information oversight. Future research could improve LLM-based agents for chemistry by optimizing cognitive load and enhancing reasoning and information verification abilities.

Please check out our paper for more details.

If our paper or related resources are valuable to your research/applications, we kindly ask for citation. Please feel free to contact us with any inquiries.

@article{yu2024chemagent,
    title={Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving},
    author={Botao Yu and Frazier N. Baker and Ziru Chen and Garrett Herb and Boyu Gou and Daniel Adu-Ampratwum and Xia Ning and Huan Sun},
    journal={arXiv preprint arXiv:2411.07228},
    year={2024}
}

ChemAgent

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

ChemAgent

Experiment

Citation