hajekad commited on
Commit
7d1b679
·
verified ·
1 Parent(s): b333ffd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -1,5 +1,38 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- A SpecTUS model pretrained on synth2_2x4.7M for 2x112k steps.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ datasets:
4
+ - MS-ML/synth2_2x4.7M
5
+ metrics:
6
+ - accuracy
7
+ library_name: transformers
8
+ tags:
9
+ - mass-spectrometry
10
+ - GC-EI-MS
11
+ - Transformer
12
+ - molecular-structure-elucidation
13
+ - compound-identification
14
  ---
15
 
16
+ A SpecTUS model pretrained on synth2_2x4.7M for 2x112k steps.
17
+
18
+ The model is a Transformer-based neural network trained to elucidate molecular structures from GC-EI-MS spectra.
19
+ The model was pretrained on a large dataset of 9.4M synthetic spectra generated from two identical sets of 4.7M
20
+ compounds using the NEIMS [1] and RASSP [2] models.
21
+
22
+ We mainly aimed to give the model an understanding of the chemical space of small molecules. The training was
23
+ conducted with a batch size of 128 for 224,000 steps, allowing the model to process each of the 9.4 million spectra approximately three times.
24
+ The entire pretraining process, including control evaluations every 16,000 steps, took 33 hours on a single Nvidia H100 GPU.
25
+
26
+ During pretraining, the percentage of correctly reconstructed validation spectra steadily increased, but remained relatively low at the end
27
+ of the stage: 27\% for RASSP-generated spectra, 13\% for NEIMS-generated spectra, and 2\% for NIST spectra. However, 94\% of the generated SMILES
28
+ strings (RASSP, NEIMS) were valid canonical molecules, with 83\% (RASSP), 65\% (NEIMS), and 11\% (NIST) having correct molecular formulas.
29
+ These results suggest that during the pretraining phase, the model successfully learned molecular structure rules and the relationship between atomic
30
+ weight and m/z values, forming a good foundation for subsequent finetuning.
31
+
32
+
33
+ The full code we used for the data processing, finetuning, evaluation model comparison and more can be found in our GitHub repository [3].
34
+
35
+
36
+ [1]: https://github.com/brain-research/deep-molecular-massspec
37
+ [2]: https://github.com/thejonaslab/rassp-public
38
+ [3]: !! <TODO>