|
|
--- |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- MS-ML/synth1_2x4.7M |
|
|
- MS-ML/synth2_2x4.8M |
|
|
metrics: |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
tags: |
|
|
- mass-spectrometry |
|
|
- GC-EI-MS |
|
|
- Transformer |
|
|
- molecular-structure-reconstruction |
|
|
- compound-identification |
|
|
--- |
|
|
|
|
|
The SpecTUS model pretrained on synth1_2x4.7 and synth2_2x4.8M combined for 448k steps. |
|
|
|
|
|
The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra. |
|
|
The model was pretrained on a large dataset of 17.2M synthetic training spectra generated from two identical sets of 8.6M |
|
|
compounds using the [NEIMS] and [RASSP] models. |
|
|
|
|
|
We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was |
|
|
conducted with a batch size of 128 for 448,000 steps, allowing the model to process each of the 17.2 million spectra approximately three times. |
|
|
The entire pretraining process, including control evaluations every 16,000 steps, took 58 hours on a single Nvidia H100 GPU. |
|
|
|
|
|
During pretraining, the percentage of correctly reconstructed structures increased steadily but it remained relatively low at the |
|
|
end of the stage: 38% for RASSP-generated spectra, 29% for NEIMS-generated spectra, and 3% for NIST spectra. However, 96% of |
|
|
the generated SMILES strings (RASSP, NEIMS) were valid canonical molecules, with 91% (RASSP), 78% (NEIMS), and 14% (NIST) having |
|
|
correct molecular formulas, though possibly incorrect structures. These results suggest that during the pretraining phase, the model |
|
|
successfully learned molecular structure rules and the relationship between atomic weight and m/z values, forming a good foundation |
|
|
for subsequent finetuning. |
|
|
|
|
|
We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not |
|
|
make the final model available, since it was finetuned on a proprietary dataset (NIST). If youhave purchased the NIST GC-EI-MS license, you can |
|
|
either fine-tune the model yourself using the code in [our GitHub repository] or contact us with a proof of the license and we will share the final |
|
|
model with you. The code we used for the data processing, finetuning, evaluation, model comparison and more can also be found in [our GitHub repository]. |
|
|
|
|
|
Our [preprint] provides more information about the task background, the final finetuned model, and the experiments. |
|
|
|
|
|
How to cite: |
|
|
```text |
|
|
@misc{hájek2025spectusspectraltranslatorunknown, |
|
|
title={SpecTUS: Spectral Translator for Unknown Structures annotation from EI-MS spectra}, |
|
|
author={Adam Hájek and Helge Hecht and Elliott J. Price and Aleš Křenek}, |
|
|
year={2025}, |
|
|
eprint={2502.05114}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2502.05114}, |
|
|
} |
|
|
``` |
|
|
|
|
|
[NEIMS]: https://github.com/brain-research/deep-molecular-massspec |
|
|
[RASSP]: https://github.com/thejonaslab/rassp-public |
|
|
[our GitHub repository]: https://github.com/hejjack/SpecTUS/ |
|
|
[preprint]: https://arxiv.org/abs/2502.05114 |