mansaripo commited on
Commit
9d9ad39
Β·
verified Β·
1 Parent(s): efc655d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ tags:
7
+ - causal-lm
8
+ - quartet-ii
9
+ - nvfp4
10
+ - low-precision-training
11
+ - pretrained
12
+ datasets:
13
+ - nvidia/ClimbMix
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # CloverLM
18
+
19
+ CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm.
20
+ Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite β€” at a fraction of the cost.
21
+
22
+ ## Model Details
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) |
27
+ | **Hidden dimension** | 3,584 |
28
+ | **GQA ratio** | 4 (7 KV heads) |
29
+ | **Context length** | 1,024 tokens |
30
+ | **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) |
31
+ | **Normalization** | RMSNorm (post-attention, post-MLP) |
32
+ | **Activation** | Squared ReLU |
33
+ | **Position encoding** | Rotary (RoPE) |
34
+ | **Weight tying** | Yes (embedding = output projection) |
35
+ | **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 |
36
+ | **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 |
37
+
38
+ ## Training
39
+
40
+ | Property | Value |
41
+ |---|---|
42
+ | **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens |
43
+ | **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) |
44
+ | **Sampled tokens** | ~309.3 B (590k steps) |
45
+ | **Optimizer** | Adam, peak LR 3Γ—10⁻³ |
46
+ | **Hardware** | 1 Γ— 8-GPU NVIDIA B300 SXM6 node |
47
+ | **Wall-clock time** | ~8 days |
48
+ | **Throughput** | ~50–54k tokens/s/GPU |
49
+ | **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) |
50
+ | **Estimated cost** | $4,600–$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) |
51
+
52
+ ## Evaluation Results
53
+
54
+ All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11.
55
+ The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels.
56
+
57
+ ### Compact Zero-Shot Suite
58
+
59
+ | Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B |
60
+ |---|---|---:|---:|---:|
61
+ | ARC-Challenge | acc | **46.3** | 41.2 | β€” |
62
+ | ARC-Challenge | acc_mutual_info | 50.9 | β€” | **51.4** |
63
+ | ARC-Easy | acc | **80.0** | 75.1 | β€” |
64
+ | ARC-Easy | acc_mutual_info | **72.4** | β€” | 68.8 |
65
+ | HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** |
66
+ | PIQA | acc_norm | 80.6 | **81.2** | 81.0 |
67
+ | **Avg (OPT-style)** | | **69.6** | 69.0 | β€” |
68
+ | **Avg (GPT-3-style)** | | 68.9 | β€” | **70.0** |
69
+
70
+ **OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`).
71
+ **GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`).
72
+
73
+ OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json).
74
+
75
+ ### Extended Benchmarks (590k checkpoint)
76
+
77
+ | Task | Metric | CloverLM | GPT-3 175B |
78
+ |---|---|---:|---:|
79
+ | Wikitext | bits per byte ↓ | 0.723 | β€” |
80
+ | LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
81
+ | NQ-Open | exact match ↑ | 7.8 | **14.6** |
82
+
83
+ ### MMLU (590k checkpoint)
84
+
85
+ | Category | 0-shot | Few-shot |
86
+ |---|---:|---:|
87
+ | Humanities | 35.4 | 35.7 |
88
+ | Social Sciences | 42.1 | 47.1 |
89
+ | STEM | 37.2 | 39.0 |
90
+ | Other | 45.2 | 49.1 |
91
+ | **Overall** | 39.4 | **41.9** |
92
+ | *OPT-175B* | β€” | *31.8* |
93
+ | *GPT-3 175B* | β€” | *43.9* |
94
+
95
+ Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%).
96
+
97
+ ### Full lm-eval Output (Quartet II kernels)
98
+
99
+ ```
100
+ | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
101
+ |----------------|------:|------|-----:|---------------|---|-----:|---|-----:|
102
+ |arc_challenge_mi| 1|none | 0|acc |↑ |0.4625|Β± |0.0146|
103
+ | | |none | 0|acc_mutual_info|↑ |0.5094|Β± |0.0146|
104
+ | | |none | 0|acc_norm |↑ |0.4923|Β± |0.0146|
105
+ |arc_easy_mi | 1|none | 0|acc |↑ |0.7997|Β± |0.0082|
106
+ | | |none | 0|acc_mutual_info|↑ |0.7239|Β± |0.0092|
107
+ | | |none | 0|acc_norm |↑ |0.7731|Β± |0.0086|
108
+ |hellaswag | 1|none | 0|acc |↑ |0.5392|Β± |0.0050|
109
+ | | |none | 0|acc_norm |↑ |0.7167|Β± |0.0045|
110
+ |piqa | 1|none | 0|acc |↑ |0.7922|Β± |0.0095|
111
+ | | |none | 0|acc_norm |↑ |0.8058|οΏ½οΏ½ |0.0092|
112
+ ```
113
+
114
+ ## Usage
115
+
116
+ ### Quick Start
117
+
118
+ ```python
119
+ from transformers import AutoModelForCausalLM, AutoTokenizer
120
+
121
+ model = AutoModelForCausalLM.from_pretrained(
122
+ "daslab-testing/CloverLM",
123
+ trust_remote_code=True,
124
+ torch_dtype="bfloat16",
125
+ )
126
+ tokenizer = AutoTokenizer.from_pretrained(
127
+ "daslab-testing/CloverLM",
128
+ trust_remote_code=True,
129
+ )
130
+
131
+ input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
132
+ output = model.generate(input_ids.to(model.device), max_new_tokens=32)
133
+ print(tokenizer.decode(output[0]))
134
+ ```
135
+
136
+ ### Running Evaluations
137
+
138
+ See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup.
139
+
140
+ ```bash
141
+ cd lm_eval
142
+ uv sync
143
+ source .venv/bin/activate
144
+
145
+ accelerate launch eval.py \
146
+ --model cloverlm \
147
+ --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \
148
+ --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \
149
+ --num_fewshot 0 \
150
+ --include_path ./ \
151
+ --trust_remote_code \
152
+ --confirm_run_unsafe_code \
153
+ --batch_size auto
154
+ ```
155
+
156
+ Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation).
157
+ Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`.
158
+
159
+ ### Dependencies
160
+
161
+ - Python β‰₯ 3.11
162
+ - PyTorch 2.10+ with CUDA 13.0
163
+ - `transformers β‰₯ 5.3.0`
164
+ - `tokenmonster β‰₯ 1.1.12`
165
+ - [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II) (for native FP4; `pseudoquant` mode works without them)
166
+
167
+ ## Architecture Details
168
+
169
+ CloverLM is a decoder-only Transformer loosely following the OLMo2 design.
170
+ Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections.
171
+ Query and key projections use RoPE and are sphere-normalized before scaling.
172
+ All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference.
173
+ Embeddings, layer norms, and the output head remain in BF16.
174
+
175
+ The model uses 264 weight tensors totaling ~4.14 B parameters.
176
+
177
+ ## Limitations
178
+
179
+ - **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited.
180
+ - **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
181
+ - **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
182
+ - **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
183
+ - **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ-Open) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.
184
+
185
+ ## Citation
186
+
187
+ ```bibtex
188
+ @article{cloverlm2026,
189
+ title = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
190
+ by Leveraging Native NVFP4},
191
+ author = {Erik Schultheis and Matin Ansaripour and Andrei Panferov and
192
+ Georgios Vlassis and Dan Alistarh},
193
+ year = {2026},
194
+ }
195
+ ```