SFT dataset details

by cyzhh - opened Feb 28, 2024

Feb 28, 2024

In Figure 2: Main Pipeline of Model Training, we see 7M Q&A in chemical corpora Chemdata, however in Appedix C hyperparameters, we see ChemData with 70.2 million entries. Can you explain for my confusion?

di-zhang-fdu

AI4Chem org Feb 29, 2024

Thanks for your comment!
We're still working on progress for ChemData and ChemLLM.
There would be some typos in our preprint.
The correct capacity info should be shown in Table 4: Statics of Instruction Datasets.
This number would be changing in the future according to our data processing strategy settings.

cyzhh

Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

di-zhang-fdu

AI4Chem org Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

cyzhh

Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

di-zhang-fdu

AI4Chem org Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

cyzhh

Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

I'm very instrested in your work. Thanks!

di-zhang-fdu

AI4Chem org Feb 29, 2024

Thanks for you help!
Also, I want to ask why we need test on GSM8K and use GSM8K train dataset? They won't influence the performance on chem?

There are tasks for property prediction and arithmetic computing for molecular structures, we want to have a exploration on how chemical computing tasks influence the mathematical ability of the model.
And we just take GSM8K as an evaluation benchmark, not included in our training dataset.

I am very interested that you can use the 7 million chemical SFT data set to improve the already high GSM8K by 6 points. I hope that the reasons for this can be analyzed later.

Some training data was collected or crawled from huggingface and other internet sources with limited analysis.
We're still working on improving and cleaning these to produce more insights for LLM's application in Chemistry.
Because of this, our training data volume, analytics, and evaluation results will all be affected by future changes in our model training or data processing strategies.
Thanks!

I'm very instrested in your work. Thanks!

Thanks for your insightful comments!

cyzhh changed discussion status to closed Mar 1, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment