daslab-testing
/

CloverLM

+---
+language:
+- en
+license: mit
+library_name: transformers
+tags:
+- causal-lm
+- quartet-ii
+- nvfp4
+- low-precision-training
+- pretrained
+datasets:
+- nvidia/ClimbMix
+pipeline_tag: text-generation
+---
+# CloverLM
+CloverLM is a **4-billion-parameter** dense decoder-only language model pretrained entirely in **native NVFP4** precision using the [Quartet II](https://github.com/IST-DASLab/Quartet-II) algorithm.
+Trained on the [ClimbMix](https://arxiv.org/abs/2504.13161) data mixture for approximately **310 billion tokens** on 8 NVIDIA B300 GPUs in roughly 8 days, CloverLM reaches zero-shot accuracy competitive with OPT-175B on a standard evaluation suite — at a fraction of the cost.
+## Model Details
+| Property | Value |
+|---|---|
+| **Parameters** | ~4.06 B (29 blocks, 28 attention heads, d_head=128) |
+| **Hidden dimension** | 3,584 |
+| **GQA ratio** | 4 (7 KV heads) |
+| **Context length** | 1,024 tokens |
+| **Vocabulary** | 32,000 ([TokenMonster](https://github.com/alasdairforsythe/tokenmonster), `englishcode-32000-strict-nocapcode-v1`) |
+| **Normalization** | RMSNorm (post-attention, post-MLP) |
+| **Activation** | Squared ReLU |
+| **Position encoding** | Rotary (RoPE) |
+| **Weight tying** | Yes (embedding = output projection) |
+| **Precision** | Quartet II NVFP4 linear layers; embeddings, norms in BF16 |
+| **Attention** | Configurable: PyTorch SDPA, Flash Attention 2/3/4 |
+## Training
+| Property | Value |
+|---|---|
+| **Data** | [ClimbMix](https://arxiv.org/abs/2504.13161) (from Nemotron-CC + SmolLM-Corpus), ~305 B tokens |
+| **Tokenizer** | [TokenMonster](https://huggingface.co/gvlassis/tokenmonster/resolve/main/englishcode-32000-strict-nocapcode-v1-eot%3D14199.vocab) (ungreedy subword, not BPE) |
+| **Sampled tokens** | ~309.3 B (590k steps) |
+| **Optimizer** | Adam, peak LR 3×10⁻³ |
+| **Hardware** | 1 × 8-GPU NVIDIA B300 SXM6 node |
+| **Wall-clock time** | ~8 days |
+| **Throughput** | ~50–54k tokens/s/GPU |
+| **Quantization** | Quartet II native NVFP4 training ([Panferov et al., 2026](https://arxiv.org/abs/2601.22813)) |
+| **Estimated cost** | $4,600–$10,700 depending on spot vs. on-demand pricing ([Verda](https://verda.com/b300)) |
+## Evaluation Results
+All evaluations are zero-shot using the [EleutherAI lm-eval harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11.
+The model is loaded via a custom `CloverLMHFLM` wrapper in BF16 with Quartet II kernels.
+### Compact Zero-Shot Suite
+| Task | Metric | CloverLM (590k) | OPT-175B | GPT-3 175B |
+|---|---|---:|---:|---:|
+| ARC-Challenge | acc | **46.3** | 41.2 | — |
+| ARC-Challenge | acc_mutual_info | 50.9 | — | **51.4** |
+| ARC-Easy | acc | **80.0** | 75.1 | — |
+| ARC-Easy | acc_mutual_info | **72.4** | — | 68.8 |
+| HellaSwag | acc_norm | 71.7 | **78.3** | **78.9** |
+| PIQA | acc_norm | 80.6 | **81.2** | 81.0 |
+| **Avg (OPT-style)** | | **69.6** | 69.0 | — |
+| **Avg (GPT-3-style)** | | 68.9 | — | **70.0** |
+**OPT-style average** = mean(ARC-C `acc`, ARC-E `acc`, HellaSwag `acc_norm`, PIQA `acc_norm`).
+**GPT-3-style average** = mean(ARC-C `acc_mutual_info`, ARC-E `acc_mutual_info`, HellaSwag `acc_norm`, PIQA `acc_norm`).
+OPT-175B baselines from the [BigScience evaluation repository](https://github.com/bigscience-workshop/bigscience/blob/master/evaluation/results/tr11/opt/bslmeval.json).
+### Extended Benchmarks (590k checkpoint)
+| Task | Metric | CloverLM | GPT-3 175B |
+|---|---|---:|---:|
+| Wikitext | bits per byte ↓ | 0.723 | — |
+| LAMBADA (OpenAI) | acc ↑ | 61.1 | **76.2** |
+| NQ-Open | exact match ↑ | 7.8 | **14.6** |
+### MMLU (590k checkpoint)
+| Category | 0-shot | Few-shot |
+|---|---:|---:|
+| Humanities | 35.4 | 35.7 |
+| Social Sciences | 42.1 | 47.1 |
+| STEM | 37.2 | 39.0 |
+| Other | 45.2 | 49.1 |
+| **Overall** | 39.4 | **41.9** |
+| *OPT-175B* | — | *31.8* |
+| *GPT-3 175B* | — | *43.9* |
+Few-shot MMLU accuracy (41.9%) substantially exceeds OPT-175B (31.8%) and approaches GPT-3 175B (43.9%).
+### Full lm-eval Output (Quartet II kernels)
+```
+|     Tasks      |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
+|----------------|------:|------|-----:|---------------|---|-----:|---|-----:|
+|arc_challenge_mi|      1|none  |     0|acc            |↑  |0.4625|±  |0.0146|
+|                |       |none  |     0|acc_mutual_info|↑  |0.5094|±  |0.0146|
+|                |       |none  |     0|acc_norm       |↑  |0.4923|±  |0.0146|
+|arc_easy_mi     |      1|none  |     0|acc            |↑  |0.7997|±  |0.0082|
+|                |       |none  |     0|acc_mutual_info|↑  |0.7239|±  |0.0092|
+|                |       |none  |     0|acc_norm       |↑  |0.7731|±  |0.0086|
+|hellaswag       |      1|none  |     0|acc            |↑  |0.5392|±  |0.0050|
+|                |       |none  |     0|acc_norm       |↑  |0.7167|±  |0.0045|
+|piqa            |      1|none  |     0|acc            |↑  |0.7922|±  |0.0095|
+|                |       |none  |     0|acc_norm       |↑  |0.8058|��  |0.0092|
+```
+## Usage
+### Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "daslab-testing/CloverLM",
+    trust_remote_code=True,
+    torch_dtype="bfloat16",
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "daslab-testing/CloverLM",
+    trust_remote_code=True,
+)
+input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
+output = model.generate(input_ids.to(model.device), max_new_tokens=32)
+print(tokenizer.decode(output[0]))
+```
+### Running Evaluations
+See the [`lm_eval/`](lm_eval/) directory for the full evaluation setup.
+```bash
+cd lm_eval
+uv sync
+source .venv/bin/activate
+accelerate launch eval.py \
+    --model cloverlm \
+    --model_args "pretrained=daslab-testing/CloverLM,dtype=bfloat16,quartet_2_impl=quartet2,attn_backend=pytorch" \
+    --tasks "arc_easy_mi,arc_challenge_mi,hellaswag,piqa" \
+    --num_fewshot 0 \
+    --include_path ./ \
+    --trust_remote_code \
+    --confirm_run_unsafe_code \
+    --batch_size auto
+```
+Use `quartet_2_impl=pseudoquant` on non-Blackwell GPUs (uses Triton-based FP4 emulation).
+Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`.
+### Dependencies
+- Python ≥ 3.11
+- PyTorch 2.10+ with CUDA 13.0
+- `transformers ≥ 5.3.0`
+- `tokenmonster ≥ 1.1.12`
+- [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II) (for native FP4; `pseudoquant` mode works without them)
+## Architecture Details
+CloverLM is a decoder-only Transformer loosely following the OLMo2 design.
+Each block applies multi-head self-attention (with grouped-query attention at ratio 4) followed by a squared-ReLU MLP, both with post-sublayer RMSNorm and residual connections.
+Query and key projections use RoPE and are sphere-normalized before scaling.
+All dense linear layers (Q, K, V, O projections and MLP layers) use Quartet II NVFP4 quantization during both training and inference.
+Embeddings, layer norms, and the output head remain in BF16.
+The model uses 264 weight tensors totaling ~4.14 B parameters.
+## Limitations
+- **Short context**: Trained with a 1,024-token context window. Performance on long-context or open-ended generation tasks may be limited.
+- **English only**: The TokenMonster vocabulary and ClimbMix training data are English-centric.
+- **No instruction tuning**: This is a base pretrained model, not fine-tuned for instruction following or chat.
+- **Contamination risk**: ClimbMix optimizes mixture weights against benchmark scores, and the upstream datasets (Nemotron-CC, SmolLM-Corpus) do not investigate benchmark contamination. Strong results should be interpreted with caution.
+- **Generative benchmarks**: The model is notably weaker on open-ended generation tasks (LAMBADA, NQ-Open) compared to the 175B baselines, reflecting the scale gap on tasks that require deeper knowledge recall.
+## Citation
+```bibtex
+@article{cloverlm2026,
+  title   = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
+             by Leveraging Native NVFP4},
+  author  = {Erik Schultheis and Matin Ansaripour and Andrei Panferov and
+             Georgios Vlassis and Dan Alistarh},
+  year    = {2026},
+}
+```