|
|
--- |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- MS-ML/synth2_2x4.7M |
|
|
metrics: |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
tags: |
|
|
- mass-spectrometry |
|
|
- GC-EI-MS |
|
|
- Transformer |
|
|
- molecular-structure-elucidation |
|
|
- compound-identification |
|
|
--- |
|
|
|
|
|
The SpecTUS model pretrained on synth2_2x4.7M for 2x112k steps. |
|
|
|
|
|
The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra. |
|
|
The model was pretrained on a large dataset of 9.4M synthetic spectra generated from two identical sets of 4.7M |
|
|
compounds using the [NEIMS] and [RASSP] models. |
|
|
|
|
|
We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was |
|
|
conducted with a batch size of 128 for 224,000 steps, allowing the model to process each of the 9.4 million spectra approximately three times. |
|
|
The entire pretraining process, including control evaluations every 16,000 steps, took 33 hours on a single Nvidia H100 GPU. |
|
|
|
|
|
During pretraining, the percentage of correctly reconstructed validation spectra steadily increased, but remained relatively low at the end: 27\% |
|
|
for RASSP-generated spectra, 13\% for NEIMS-generated spectra, and 2\% for NIST spectra. However, 94\% of the generated SMILES |
|
|
strings (RASSP, NEIMS) were valid canonical molecules, with 83\% (RASSP), 65\% (NEIMS), and 11\% (NIST) having correct molecular formulas. |
|
|
These results suggest that during the pretraining phase, the model successfully learned molecular structure rules and the relationship between atomic |
|
|
weight and m/z values, forming a good foundation for subsequent finetuning. |
|
|
|
|
|
We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not |
|
|
make the final model available, since it was finetuned on a proprietary dataset (NIST), you can fine-tune it yourself if you have purchased the license. |
|
|
The full code we used for the data processing, finetuning, evaluation, model comparison and more can be found in [our GitHub repository] (TODO). |
|
|
|
|
|
Our [preprint] (TODO) provides more information about the task background, the final finetuned model, and the experiments. |
|
|
|
|
|
[NEIMS]: https://github.com/brain-research/deep-molecular-massspec |
|
|
[RASSP]: https://github.com/thejonaslab/rassp-public |
|
|
[our GitHub repository]: !TODO! |
|
|
[preprint]: !TODO! |