File size: 8,617 Bytes

---
language:
- en
license: mit
library_name: transformers
tags:
- causal-lm
- quartet-ii
- nvfp4
- low-precision-training
- pretrained
datasets:
- nvidia/ClimbMix
pipeline_tag: text-generation
---

# CloverLM

CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm.
Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite — at a fraction of the cost.

## Model Details

| Property | Value |
|---|---|
| **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) |
| **Hidden dimension** | 3,584 |
| **GQA ratio** | 4 (7 KV heads) |
| **Context length** | 1,024 tokens |
| **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) |
| **Normalization** | RMSNorm (post-attention, post-MLP) |
| **Activation** | Squared ReLU |
| **Position encoding** | Rotary (RoPE) |
| **Weight tying** | Yes (embedding = output projection) |
| **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 |
| **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 |

## Training

| Property | Value |
|---|---|
| **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens |
| **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) |
| **Sampled tokens** | ~309.3 B (590k steps) |
| **Optimizer** | Adam, peak LR 3×10⁻³ |
| **Hardware** | 1 × 8-GPU NVIDIA B300 SXM6 node |
| **Wall-clock time** | ~8 days |
| **Throughput** | ~50–54k tokens/s/GPU |
| **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) |
| **Estimated cost** | $4,600–$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) |

## Evaluation Results

All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11.
The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels.

### Compact Zero-Shot Suite

| Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B |
|---|---|---:|---:|---:|
| ARC-Challenge | acc | **46.3** | 41.2 | — |
| ARC-Challenge | acc_mutual_info | 50.9 | — | **51.4** |
| ARC-Easy | acc | **80.0** | 75.1 | — |
| ARC-Easy | acc_mutual_info | **72.4** | — | 68.8 |
| HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** |
| PIQA | acc_norm | 80.6 | **81.2** | 81.0 |
| **Avg (OPT-style)** | | **69.6** | 69.0 | — |
| **Avg (GPT-3-style)** | | 68.9 | — | **70.0** |

**OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`).
**GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`).

OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json).

### Extended Benchmarks (590k checkpoint)

| Task | Metric | CloverLM | GPT-3 175B |
|---|---|---:|---:|
| Wikitext | bits per byte ↓ | 0.723 | — |
| LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
| NQ | exact match ↑ | 7.8 | **14.6** |

### MMLU (590k checkpoint)

| Category | 0-shot | Few-shot |
|---|---:|---:|
| Humanities | 35.4 | 35.7 |
| Social Sciences | 42.1 | 47.1 |
| STEM | 37.2 | 39.0 |
| Other | 45.2 | 49.1 |
| **Overall** | 39.4 | **41.9** |
| *OPT-175B* | — | *31.8* |
| *GPT-3 175B* | — | *43.9* |

Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%).

### Full lm-eval Output (Quartet II kernels)

```
|     Tasks      |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
|----------------|------:|------|-----:|---------------|---|-----:|---|-----:|
|arc_challenge_mi|      1|none  |     0|acc            |↑  |0.4625|±  |0.0146|
|                |       |none  |     0|acc_mutual_info|↑  |0.5094|±  |0.0146|
|                |       |none  |     0|acc_norm       |↑  |0.4923|±  |0.0146|
|arc_easy_mi     |      1|none  |     0|acc            |↑  |0.7997|±  |0.0082|
|                |       |none  |     0|acc_mutual_info|↑  |0.7239|±  |0.0092|
|                |       |none  |     0|acc_norm       |↑  |0.7731|±  |0.0086|
|hellaswag       |      1|none  |     0|acc            |↑  |0.5392|±  |0.0050|
|                |       |none  |     0|acc_norm       |↑  |0.7167|±  |0.0045|
|piqa            |      1|none  |     0|acc            |↑  |0.7922|±  |0.0095|
|                |       |none  |     0|acc_norm       |↑  |0.8058|±  |0.0092|
```

## Usage

### Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "daslab-testing/CloverLM",
    trust_remote_code=True,
    dtype="bfloat16",
    quartet_2_impl="pseudoquant",  # on non-Blackwell GPUs or "quartet2" for native NVFP4 kernel
).to("cuda")  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(
    "daslab-testing/CloverLM",
    trust_remote_code=True,
)

input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
output = model.generate(input_ids.to(model.device), max_new_tokens=32)
print(tokenizer.decode(output[0]))
```
Note that `quartet_2_impl="quartet2"` only supports inputs with `(micro_batch_size * seq_length) % 128 == 0`.

### Running Evaluations

See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup.

```bash
cd lm_eval
uv sync
source .venv/bin/activate

accelerate launch eval.py \
    --model cloverlm \
    --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \
    --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \
    --num_fewshot 0 \
    --include_path ./ \
    --trust_remote_code \
    --confirm_run_unsafe_code \
    --batch_size auto
```

Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation).
Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`.

### Serving with vLLM

CloverLM can be served using [vLLM](https://github.com/vllm-project/vllm) with a custom Quartet II quantization plugin. See [`vllm_plugin/SERVING.md`](vllm_plugin/SERVING.md) for full setup instructions.

### Dependencies

- Python ≥ 3.11
- PyTorch 2.10+ with CUDA 13.0
- `transformers ≥ 5.3.0`
- `tokenmonster ≥ 1.1.12`
- [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II)

## Architecture Details

CloverLM is a decoder-only Transformer loosely following the OLMo2 design.
Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections.
Query and key projections use RoPE and are sphere-normalized before scaling.
All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference.
Embeddings, layer norms, and the output head remain in BF16.

The model uses 264 weight tensors totaling ~4.14 B parameters.

## Limitations

- **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited.
- **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
- **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
- **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
- **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.

## Citation

```bibtex
@article{cloverlm2026,
  title   = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
             by Leveraging Native NVFP4},
  author  = {Erik Schultheis and Georgios Vlassis and Matin Ansaripour and
             Andrei Panferov and Dan Alistarh},
  year    = {2026},
}
```