File size: 8,617 Bytes
9d9ad39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389961d
9d9ad39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43e4bf9
afd6f57
43e4bf9
 
9d9ad39
 
 
 
 
 
 
 
 
afd6f57
9d9ad39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a372444
 
 
 
9d9ad39
 
 
 
 
 
afd6f57
9d9ad39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389961d
9d9ad39
 
 
 
 
 
 
afd6f57
 
9d9ad39
 
afd6f57
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
language:
- en
license: mit
library_name: transformers
tags:
- causal-lm
- quartet-ii
- nvfp4
- low-precision-training
- pretrained
datasets:
- nvidia/ClimbMix
pipeline_tag: text-generation
---

# CloverLM

CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm.
Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite β€” at a fraction of the cost.

## Model Details

| Property | Value |
|---|---|
| **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) |
| **Hidden dimension** | 3,584 |
| **GQA ratio** | 4 (7 KV heads) |
| **Context length** | 1,024 tokens |
| **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) |
| **Normalization** | RMSNorm (post-attention, post-MLP) |
| **Activation** | Squared ReLU |
| **Position encoding** | Rotary (RoPE) |
| **Weight tying** | Yes (embedding = output projection) |
| **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 |
| **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 |

## Training

| Property | Value |
|---|---|
| **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens |
| **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) |
| **Sampled tokens** | ~309.3 B (590k steps) |
| **Optimizer** | Adam, peak LR 3Γ—10⁻³ |
| **Hardware** | 1 Γ— 8-GPU NVIDIA B300 SXM6 node |
| **Wall-clock time** | ~8 days |
| **Throughput** | ~50–54k tokens/s/GPU |
| **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) |
| **Estimated cost** | $4,600–$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) |

## Evaluation Results

All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11.
The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels.

### Compact Zero-Shot Suite

| Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B |
|---|---|---:|---:|---:|
| ARC-Challenge | acc | **46.3** | 41.2 | β€” |
| ARC-Challenge | acc_mutual_info | 50.9 | β€” | **51.4** |
| ARC-Easy | acc | **80.0** | 75.1 | β€” |
| ARC-Easy | acc_mutual_info | **72.4** | β€” | 68.8 |
| HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** |
| PIQA | acc_norm | 80.6 | **81.2** | 81.0 |
| **Avg (OPT-style)** | | **69.6** | 69.0 | β€” |
| **Avg (GPT-3-style)** | | 68.9 | β€” | **70.0** |

**OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`).
**GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`).

OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json).

### Extended Benchmarks (590k checkpoint)

| Task | Metric | CloverLM | GPT-3 175B |
|---|---|---:|---:|
| Wikitext | bits per byte ↓ | 0.723 | β€” |
| LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
| NQ | exact match ↑ | 7.8 | **14.6** |

### MMLU (590k checkpoint)

| Category | 0-shot | Few-shot |
|---|---:|---:|
| Humanities | 35.4 | 35.7 |
| Social Sciences | 42.1 | 47.1 |
| STEM | 37.2 | 39.0 |
| Other | 45.2 | 49.1 |
| **Overall** | 39.4 | **41.9** |
| *OPT-175B* | β€” | *31.8* |
| *GPT-3 175B* | β€” | *43.9* |

Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%).

### Full lm-eval Output (Quartet II kernels)

```
|     Tasks      |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
|----------------|------:|------|-----:|---------------|---|-----:|---|-----:|
|arc_challenge_mi|      1|none  |     0|acc            |↑  |0.4625|Β±  |0.0146|
|                |       |none  |     0|acc_mutual_info|↑  |0.5094|Β±  |0.0146|
|                |       |none  |     0|acc_norm       |↑  |0.4923|Β±  |0.0146|
|arc_easy_mi     |      1|none  |     0|acc            |↑  |0.7997|Β±  |0.0082|
|                |       |none  |     0|acc_mutual_info|↑  |0.7239|Β±  |0.0092|
|                |       |none  |     0|acc_norm       |↑  |0.7731|Β±  |0.0086|
|hellaswag       |      1|none  |     0|acc            |↑  |0.5392|Β±  |0.0050|
|                |       |none  |     0|acc_norm       |↑  |0.7167|Β±  |0.0045|
|piqa            |      1|none  |     0|acc            |↑  |0.7922|Β±  |0.0095|
|                |       |none  |     0|acc_norm       |↑  |0.8058|Β±  |0.0092|
```

## Usage

### Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "daslab-testing/CloverLM",
    trust_remote_code=True,
    dtype="bfloat16",
    quartet_2_impl="pseudoquant",  # on non-Blackwell GPUs or "quartet2" for native NVFP4 kernel
).to("cuda")  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/CloverLM",
    trust_remote_code=True,
)

input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
output = model.generate(input_ids.to(model.device), max_new_tokens=32)
print(tokenizer.decode(output[0]))
```
Note that `quartet_2_impl="quartet2"` only supports inputs with `(micro_batch_size * seq_length) % 128 == 0`.

### Running Evaluations

See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup.

```bash
cd lm_eval
uv sync
source .venv/bin/activate

accelerate launch eval.py \
    --model cloverlm \
    --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \
    --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \
    --num_fewshot 0 \
    --include_path ./ \
    --trust_remote_code \
    --confirm_run_unsafe_code \
    --batch_size auto
```

Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation).
Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`.

### Serving with vLLM

CloverLM can be served using [vLLM](https://github.com/vllm-project/vllm) with a custom Quartet II quantization plugin. See [`vllm_plugin/SERVING.md`](vllm_plugin/SERVING.md) for full setup instructions.

### Dependencies

- Python β‰₯ 3.11
- PyTorch 2.10+ with CUDA 13.0
- `transformers β‰₯ 5.3.0`
- `tokenmonster β‰₯ 1.1.12`
- [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II)

## Architecture Details

CloverLM is a decoder-only Transformer loosely following the OLMo2 design.
Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections.
Query and key projections use RoPE and are sphere-normalized before scaling.
All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference.
Embeddings, layer norms, and the output head remain in BF16.

The model uses 264 weight tensors totaling ~4.14 B parameters.

## Limitations

- **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited.
- **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
- **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
- **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
- **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.

## Citation

```bibtex
@article{cloverlm2026,
  title   = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
             by Leveraging Native NVFP4},
  author  = {Erik Schultheis and Georgios Vlassis and Matin Ansaripour and
             Andrei Panferov and Dan Alistarh},
  year    = {2026},
}
```