Allanatrix commited on
Commit
0fb70ce
·
verified ·
1 Parent(s): 09b1a69

Document GeMS source dataset

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  - retrieval
15
  - chemistry
16
  datasets:
17
- - GeMS
18
  metrics:
19
  - hit@k
20
  - cosine_similarity
@@ -48,7 +48,7 @@ A useful production-facing record should include the predicted candidate structu
48
 
49
  ## Training Data
50
 
51
- The foundation encoder was trained on the GeMS MS/MS corpus prepared into deterministic Parquet shard windows. The phase-1 pretraining campaign processed approximately `201M` spectra across three scale windows. The training stack used self-supervised spectral objectives with light evaluation during training and deeper holdout evaluation at phase boundaries.
52
 
53
  The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.
54
 
 
14
  - retrieval
15
  - chemistry
16
  datasets:
17
+ - roman-bushuiev/GeMS
18
  metrics:
19
  - hit@k
20
  - cosine_similarity
 
48
 
49
  ## Training Data
50
 
51
+ The foundation encoder was derived from the public GeMS MS/MS dataset released by the DreaMS authors at `roman-bushuiev/GeMS` on Hugging Face: https://huggingface.co/datasets/roman-bushuiev/GeMS. The GeMS data were processed into deterministic Parquet shard windows for the NexaMass training pipeline. The phase-1 pretraining campaign processed approximately `201M` spectra across three scale windows. The training stack used self-supervised spectral objectives with light evaluation during training and deeper holdout evaluation at phase boundaries.
52
 
53
  The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.
54