Document GeMS source dataset
Browse files
README.md
CHANGED
|
@@ -14,7 +14,7 @@ tags:
|
|
| 14 |
- retrieval
|
| 15 |
- chemistry
|
| 16 |
datasets:
|
| 17 |
-
- GeMS
|
| 18 |
metrics:
|
| 19 |
- hit@k
|
| 20 |
- cosine_similarity
|
|
@@ -48,7 +48,7 @@ A useful production-facing record should include the predicted candidate structu
|
|
| 48 |
|
| 49 |
## Training Data
|
| 50 |
|
| 51 |
-
The foundation encoder was
|
| 52 |
|
| 53 |
The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.
|
| 54 |
|
|
|
|
| 14 |
- retrieval
|
| 15 |
- chemistry
|
| 16 |
datasets:
|
| 17 |
+
- roman-bushuiev/GeMS
|
| 18 |
metrics:
|
| 19 |
- hit@k
|
| 20 |
- cosine_similarity
|
|
|
|
| 48 |
|
| 49 |
## Training Data
|
| 50 |
|
| 51 |
+
The foundation encoder was derived from the public GeMS MS/MS dataset released by the DreaMS authors at `roman-bushuiev/GeMS` on Hugging Face: https://huggingface.co/datasets/roman-bushuiev/GeMS. The GeMS data were processed into deterministic Parquet shard windows for the NexaMass training pipeline. The phase-1 pretraining campaign processed approximately `201M` spectra across three scale windows. The training stack used self-supervised spectral objectives with light evaluation during training and deeper holdout evaluation at phase boundaries.
|
| 52 |
|
| 53 |
The molecular structure alignment stage used a corrected labeled surface where RDKit Morgan targets were present and enforced. Broad foundation pretraining metrics should be interpreted as SSL representation evidence. Molecular grounding should be interpreted through the corrected structure-alignment run rather than through unlabeled foundation shards.
|
| 54 |
|