Document GeMS source dataset

Files changed (1) hide show

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ tags:
 - retrieval
 - chemistry
 datasets:
-- GeMS
 metrics:
 - hit@k
 - cosine_similarity
@@ -48,7 +48,7 @@ A useful production-facing record should include the predicted candidate structu
 ## Training Data
-The foundation encoder was trained on the GeMS MS/MS corpus prepared into deterministic Parquet shard windows. The phase-1 pretraining campaign processed approximately `201M` spectra across three scale windows. The training stack used self-supervised spectral objectives with light evaluation during training and deeper holdout evaluation at phase boundaries.
 The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.

 - retrieval
 - chemistry
 datasets:
+- roman-bushuiev/GeMS
 metrics:
 - hit@k
 - cosine_similarity
 ## Training Data
+The foundation encoder was derived from the public GeMS MS/MS dataset released by the DreaMS authors at `roman-bushuiev/GeMS` on Hugging Face: https://huggingface.co/datasets/roman-bushuiev/GeMS. The GeMS data were processed into deterministic Parquet shard windows for the NexaMass training pipeline. The phase-1 pretraining campaign processed approximately `201M` spectra across three scale windows. The training stack used self-supervised spectral objectives with light evaluation during training and deeper holdout evaluation at phase boundaries.
 The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.