hajekad commited on
Commit
6266e17
·
verified ·
1 Parent(s): eabf59b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -14
README.md CHANGED
@@ -1,7 +1,8 @@
1
  ---
2
  license: cc-by-4.0
3
  datasets:
4
- - MS-ML/synth2_2x4.7M
 
5
  metrics:
6
  - accuracy
7
  library_name: transformers
@@ -9,33 +10,35 @@ tags:
9
  - mass-spectrometry
10
  - GC-EI-MS
11
  - Transformer
12
- - molecular-structure-elucidation
13
  - compound-identification
14
  ---
15
 
16
- The SpecTUS model pretrained on synth2_2x4.7M for 2x112k steps.
17
 
18
  The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra.
19
- The model was pretrained on a large dataset of 9.4M synthetic spectra generated from two identical sets of 4.7M
20
  compounds using the [NEIMS] and [RASSP] models.
21
 
22
  We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was
23
- conducted with a batch size of 128 for 224,000 steps, allowing the model to process each of the 9.4 million spectra approximately three times.
24
- The entire pretraining process, including control evaluations every 16,000 steps, took 33 hours on a single Nvidia H100 GPU.
25
 
26
- During pretraining, the percentage of correctly reconstructed validation spectra steadily increased, but remained relatively low at the end: 27\%
27
- for RASSP-generated spectra, 13\% for NEIMS-generated spectra, and 2\% for NIST spectra. However, 94\% of the generated SMILES
28
- strings (RASSP, NEIMS) were valid canonical molecules, with 83\% (RASSP), 65\% (NEIMS), and 11\% (NIST) having correct molecular formulas.
29
- These results suggest that during the pretraining phase, the model successfully learned molecular structure rules and the relationship between atomic
30
- weight and m/z values, forming a good foundation for subsequent finetuning.
 
31
 
32
  We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not
33
- make the final model available, since it was finetuned on a proprietary dataset (NIST), you can fine-tune it yourself if you have purchased the license.
34
- The full code we used for the data processing, finetuning, evaluation, model comparison and more can be found in [our GitHub repository] (TODO).
 
35
 
36
  Our [preprint] (TODO) provides more information about the task background, the final finetuned model, and the experiments.
37
 
38
  [NEIMS]: https://github.com/brain-research/deep-molecular-massspec
39
  [RASSP]: https://github.com/thejonaslab/rassp-public
40
- [our GitHub repository]: !TODO!
41
  [preprint]: !TODO!
 
1
  ---
2
  license: cc-by-4.0
3
  datasets:
4
+ - MS-ML/synth1_2x4.7M
5
+ - MS-ML/synth2_2x4.8M
6
  metrics:
7
  - accuracy
8
  library_name: transformers
 
10
  - mass-spectrometry
11
  - GC-EI-MS
12
  - Transformer
13
+ - molecular-structure-reconstruction
14
  - compound-identification
15
  ---
16
 
17
+ The SpecTUS model pretrained on synth1_2x4.7 and synth2_2x4.8M combined for 448k steps.
18
 
19
  The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra.
20
+ The model was pretrained on a large dataset of 17.2M synthetic training spectra generated from two identical sets of 8.6M
21
  compounds using the [NEIMS] and [RASSP] models.
22
 
23
  We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was
24
+ conducted with a batch size of 128 for 448,000 steps, allowing the model to process each of the 17.2 million spectra approximately three times.
25
+ The entire pretraining process, including control evaluations every 16,000 steps, took 58 hours on a single Nvidia H100 GPU.
26
 
27
+ During pretraining, the percentage of correctly reconstructed structures increased steadily but it remained relatively low at the
28
+ end of the stage: 38% for RASSP-generated spectra, 29% for NEIMS-generated spectra, and 3% for NIST spectra. However, 96% of
29
+ the generated SMILES strings (RASSP, NEIMS) were valid canonical molecules, with 91% (RASSP), 78% (NEIMS), and 14% (NIST) having
30
+ correct molecular formulas, though possibly incorrect structures. These results suggest that during the pretraining phase, the model
31
+ successfully learned molecular structure rules and the relationship between atomic weight and m/z values, forming a good foundation
32
+ for subsequent finetuning.
33
 
34
  We suggest to finetune the model further on experimental data (NIST, Wiley) to reach the performance reported in our [preprint]. Though we can not
35
+ make the final model available, since it was finetuned on a proprietary dataset (NIST). If youhave purchased the NIST GC-EI-MS license, you can
36
+ either fine-tune the model yourself using the code in [our GitHub repository] or contact us with a proof of the license and we will share the final
37
+ model with you. The code we used for the data processing, finetuning, evaluation, model comparison and more can also be found in [our GitHub repository] (TODO).
38
 
39
  Our [preprint] (TODO) provides more information about the task background, the final finetuned model, and the experiments.
40
 
41
  [NEIMS]: https://github.com/brain-research/deep-molecular-massspec
42
  [RASSP]: https://github.com/thejonaslab/rassp-public
43
+ [our GitHub repository]: https://github.com/hejjack/SpecTUS/
44
  [preprint]: !TODO!