daslab-testing
/

CloverLM

Text Generation

low-precision-training

Model card Files Files and versions

mansaripo commited on Mar 19

Commit

389961d

·

verified ·

1 Parent(s): 812abc2

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -78,7 +78,7 @@ OPT-175B baselines from the [BigScience evaluation repository](https://github.co
 |---|---|---:|---:|
 | Wikitext | bits per byte ↓ | 0.723 | — |
 | LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
-| NQ-Open | exact match ↑ | 7.8 | **14.6** |
 ### MMLU (590k checkpoint)
@@ -180,7 +180,7 @@ The model uses 264 weight tensors totaling ~4.14 B parameters.
 - **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
 - **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
 - **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
-- **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ-Open) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.
 ## Citation

 |---|---|---:|---:|
 | Wikitext | bits per byte ↓ | 0.723 | — |
 | LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
+| NQ | exact match ↑ | 7.8 | **14.6** |
 ### MMLU (590k checkpoint)
 - **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
 - **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
 - **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
+- **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.
 ## Citation