Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,8 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
datasets:
|
| 4 |
-
- MS-ML/
|
|
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
library_name: transformers
|
|
@@ -9,33 +10,35 @@ tags:
|
|
| 9 |
- mass-spectrometry
|
| 10 |
- GC-EI-MS
|
| 11 |
- Transformer
|
| 12 |
-
- molecular-structure-
|
| 13 |
- compound-identification
|
| 14 |
---
|
| 15 |
|
| 16 |
-
The SpecTUS model pretrained on synth2_2x4.
|
| 17 |
|
| 18 |
The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra.
|
| 19 |
-
The model was pretrained on a large dataset of
|
| 20 |
compounds using the [NEIMS] and [RASSP] models.
|
| 21 |
|
| 22 |
We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was
|
| 23 |
-
conducted with a batch size of 128 for
|
| 24 |
-
The entire pretraining process, including control evaluations every 16,000 steps, took
|
| 25 |
|
| 26 |
-
During pretraining, the percentage of correctly reconstructed
|
| 27 |
-
for RASSP-generated spectra,
|
| 28 |
-
strings (RASSP, NEIMS) were valid canonical molecules, with
|
| 29 |
-
These results suggest that during the pretraining phase, the model
|
| 30 |
-
weight and m/z values, forming a good foundation
|
|
|
|
| 31 |
|
| 32 |
We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not
|
| 33 |
-
make the final model available, since it was finetuned on a proprietary dataset (NIST)
|
| 34 |
-
|
|
|
|
| 35 |
|
| 36 |
Our [preprint] (TODO) provides more information about the task background, the final finetuned model, and the experiments.
|
| 37 |
|
| 38 |
[NEIMS]: https://github.com/brain-research/deep-molecular-massspec
|
| 39 |
[RASSP]: https://github.com/thejonaslab/rassp-public
|
| 40 |
-
[our GitHub repository]:
|
| 41 |
[preprint]: !TODO!
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
datasets:
|
| 4 |
+
- MS-ML/synth1_2x4.7M
|
| 5 |
+
- MS-ML/synth2_2x4.8M
|
| 6 |
metrics:
|
| 7 |
- accuracy
|
| 8 |
library_name: transformers
|
|
|
|
| 10 |
- mass-spectrometry
|
| 11 |
- GC-EI-MS
|
| 12 |
- Transformer
|
| 13 |
+
- molecular-structure-reconstruction
|
| 14 |
- compound-identification
|
| 15 |
---
|
| 16 |
|
| 17 |
+
The SpecTUS model pretrained on synth1_2x4.7 and synth2_2x4.8M combined for 448k steps.
|
| 18 |
|
| 19 |
The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra.
|
| 20 |
+
The model was pretrained on a large dataset of 17.2M synthetic training spectra generated from two identical sets of 8.6M
|
| 21 |
compounds using the [NEIMS] and [RASSP] models.
|
| 22 |
|
| 23 |
We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was
|
| 24 |
+
conducted with a batch size of 128 for 448,000 steps, allowing the model to process each of the 17.2 million spectra approximately three times.
|
| 25 |
+
The entire pretraining process, including control evaluations every 16,000 steps, took 58 hours on a single Nvidia H100 GPU.
|
| 26 |
|
| 27 |
+
During pretraining, the percentage of correctly reconstructed structures increased steadily but it remained relatively low at the
|
| 28 |
+
end of the stage: 38% for RASSP-generated spectra, 29% for NEIMS-generated spectra, and 3% for NIST spectra. However, 96% of
|
| 29 |
+
the generated SMILES strings (RASSP, NEIMS) were valid canonical molecules, with 91% (RASSP), 78% (NEIMS), and 14% (NIST) having
|
| 30 |
+
correct molecular formulas, though possibly incorrect structures. These results suggest that during the pretraining phase, the model
|
| 31 |
+
successfully learned molecular structure rules and the relationship between atomic weight and m/z values, forming a good foundation
|
| 32 |
+
for subsequent finetuning.
|
| 33 |
|
| 34 |
We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not
|
| 35 |
+
make the final model available, since it was finetuned on a proprietary dataset (NIST). If youhave purchased the NIST GC-EI-MS license, you can
|
| 36 |
+
either fine-tune the model yourself using the code in [our GitHub repository] or contact us with a proof of the license and we will share the final
|
| 37 |
+
model with you. The code we used for the data processing, finetuning, evaluation, model comparison and more can also be found in [our GitHub repository] (TODO).
|
| 38 |
|
| 39 |
Our [preprint] (TODO) provides more information about the task background, the final finetuned model, and the experiments.
|
| 40 |
|
| 41 |
[NEIMS]: https://github.com/brain-research/deep-molecular-massspec
|
| 42 |
[RASSP]: https://github.com/thejonaslab/rassp-public
|
| 43 |
+
[our GitHub repository]: https://github.com/hejjack/SpecTUS/
|
| 44 |
[preprint]: !TODO!
|