GDN-2 370M (FineWeb-Edu 100B)
A pure-recurrent linear-attention language model trained from scratch on FineWeb-Edu. Architecture: Gated DeltaNet 2 (GDN-2) β the recurrence of Gated DeltaNet with two channel-wise gates.
| Status (latest) | π‘ In-progress pretraining β see the Live section below |
| Architecture | GDN-2 (pure recurrent, no sliding-window attention) |
| Parameters | 370 M |
| Training data | FineWeb-Edu sample/100BT (β100 B English tokens, academic-focus web) |
| Tokenizer | TinyLlama v1.1 (vocab = 32 000) |
| Context length | 4 096 (training) |
| Hardware | 8 Γ NVIDIA H200 143 GB (DDP, fully sharded data parallel) |
| License | Apache-2.0 |
| Trained by | LLM-OS-Models Β· code at gyunggyung/long-gdn |
This repository publishes the checkpoints produced by the campaign described in
docs/LMR_FULL_GUIDE_KO.md.
A new checkpoint is uploaded roughly every 5 B trained tokens.
1. What is GDN-2?
GDN-2 (Gated DeltaNet 2) is a pure-recurrent token mixer: there is no softmax attention, no sliding-window attention, and no Transformer block in the critical path. Every layer is a learned linear-recurrent state update.
Compared to its predecessor Gated DeltaNet (KDA), GDN-2 replaces the single scalar write/erase gate with two channel-wise gates:
- $b_t \in \mathbb{R}^{d_k}$ β channel-wise erase gate (replaces KDA's scalar $\beta_t$)
- $w_t \in \mathbb{R}^{d_v}$ β channel-wise write gate (new in GDN-2)
- $g_t$ β output silu-gate (same as Gated DeltaNet)
Setting $b_t = \beta_t\mathbf{1}$ and $w_t = \beta_t\mathbf{1}$ recovers KDA exactly, so GDN-2 is a strict generalisation.
Why this matters for long-context: the recurrent state $S_t$ is $O(d_k \cdot d_v)$ per head β constant in sequence length. Training and inference scale linearly with tokens, not quadratically like softmax attention.
2. Model configuration
name = "gdn2_370M"
block_size = 4096 # training context length
vocab_size = 32000 # TinyLlama tokenizer
n_layer = 16
n_head = 8
n_embd = 1024
head_dim = 128
intermediate_size = 2048 # LLaMAMLP expansion
gdn2_per_layer = 1 # 1 = pure recurrent, no SWA fallback
local_window = 2048 # unused when gdn2_per_layer=1
rotary_percentage = 1.0
norm = FusedRMSNorm (eps=1e-5)
mlp = LLaMAMLP
parallel_residual = False
mamba_init = True
The recurrent state per head is $d_k \times d_v = 128 \times 128 = 16{,}384$ floats. Across 8 heads and 16 layers this is 2.1 M recurrent state floats, designed to match Mamba-370M's recurrent-state budget.
3. Training recipe
| Hyperparameter | Value |
|---|---|
| Corpus | FineWeb-Edu sample/100BT |
| Target tokens | 100 000 000 000 (100 B) |
| Optimizer | AdamW, Ξ² = (0.9, 0.95), weight_decay = 0.1 |
| Gradient clip | 1.0 |
| Learning rate | 4 Γ 10β»β΄ (peak), cosine schedule |
| Warmup | 1 Γ 10βΉ tokens |
| Micro-batch Γ GPU | 8 sequences Γ 4 096 tokens |
| Gradient accumulation | 16 |
| Data-parallel workers | 8 |
| Global batch | 1 024 sequences = 4 194 304 tokens / step |
| Save interval | every 1 200 steps β 5 B tokens |
| Eval interval | every 960 steps β 4 B tokens |
| Eval iterations | 15 batches Γ 4 seq lengths (4 K / 8 K / 12 K / 16 K) |
| Eval tokenizer budget | β 1.97 M tokens per validation pass |
Measured throughput on 8 Γ H200: 72.7 K tokens / sec / GPU (β 580 K tokens / sec aggregate). Wall-clock estimate end-to-end: β 41 hours.
The exact launch script is checked in at
off/GatedDeltaNet-2/scripts/pretrain_gdn2_370m_fineweb_edu_100bt.sh.
4. Live training status
This model is mid-training. New checkpoints appear here every ~5 B tokens.
The latest live status is in
docs/OVERNIGHT_LIVE_STATUS_KO.md.
| Milestone | Step | Tokens | Status |
|---|---|---|---|
| First val pass (sanity, after infinite-loop fix) | 960 | 4.0 B | β val_loss 2.85 / 2.83 / 2.83 / 2.84 (4 K/8 K/12 K/16 K), 96.7 s |
| First checkpoint + HF upload | 1 200 | 5.0 B | β 2026-07-04 03:17 KST |
| Second checkpoint | 2 400 | 10 B | β³ pending |
| Mid-training | 6 000 | 25 B | β³ pending |
| Late-training | 12 000 | 50 B | β³ pending |
| Final | 24 000 | 100 B | β³ target 2026-07-05 ~05:00 KST |
Checkpoint naming gotcha (will be cleaned up post-run):
the milestone file checkpoint-1B-model-ckpt.pth actually contains the 5 B-token
state. The "1B" suffix is the milestone index (first 5 B milestone), not the
token count. Subsequent milestones will be named checkpoint-2B-β¦, checkpoint-3B-β¦,
etc. The README will be updated to clarify after the run completes.
5. How to load
The checkpoint is a raw PyTorch state dict in the layout used by
lit_gpt.model.GPT configured with gdn2_370M. The repo also mirrors the
training code (the lit_gpt/ package from off/GatedDeltaNet-2/).
import torch
from lit_gpt.config import config_from_name
from lit_gpt.model import GPT
ckpt = torch.load("checkpoint-1B-model-ckpt.pth", map_location="cpu")
# top-level key is "model" β the inner state dict
state = ckpt["model"] if "model" in ckpt else ckpt
cfg = config_from_name("gdn2_370M")
model = GPT(cfg)
model.load_state_dict(state, strict=True)
model.eval()
To run a quick continuation / generation, see the
off/GatedDeltaNet-2/
subproject β the same lit_gpt package is used for both training and inference.
6. Intended use
This model is released for research purposes only.
Appropriate uses:
- Studying the GDN-2 recurrence and comparing against other linear / recurrent architectures (Mamba, RWKV, Gated DeltaNet, RetNet, Lightning Attention, β¦).
- Long-context retrieval and associative-recall experiments where the $O(N)$ training cost matters.
- Component-level ablations (gate design, head count, recurrent-state size).
Inappropriate uses:
- Production deployment. The model is small (370 M), mid-training, and instruction-following has not been taught.
- Downstream safety-critical tasks.
- Anything requiring benchmark numbers we have not yet published. Wait for post-training evaluation.
7. Limitations (as of latest checkpoint)
- Mid-training. Loss is still decreasing; downstream metrics will move.
- Scale. 370 M parameters and a 4 K training context β small by modern standards. We chose this scale deliberately to match Mamba-370M and to fit a 36-hour campaign budget.
- No instruction tuning. Outputs are raw next-token completions.
- English-only training data (FineWeb-Edu is English academic web).
- No benchmark numbers yet. HellaSwag / ARC / MMLU / RULER will be run on the final 100 B checkpoint and added here.
8. Evaluation plan (post-training)
Once the 100 B-token checkpoint lands we will run:
| Suite | Length | Source |
|---|---|---|
| HellaSwag, ARC-e, ARC-c, PIQA | standard | lm-evaluation-harness |
| MMLU (5-shot) | standard | lm-evaluation-harness |
| RULER (niah, mqar, ct, cwe) | 4 K / 8 K / 16 K | custom loader |
| LongBench (retrieval subset) | up to 32 K | custom loader |
| BABILong (qa1βqa5) | up to 32 K | custom loader |
Results will be appended to this card and to
docs/LMR_PUBLIC_BENCHMARK_SUMMARY_KO.md.
9. Citation
The GDN-2 architecture was introduced by NVIDIA in 2026. Please cite the upstream GDN-2 paper for the architecture itself.
For this specific checkpoint:
@misc{gdn2_370m_fineweb_edu_100b,
title = {GDN-2 370M trained on FineWeb-Edu 100B tokens},
author = {LLM-OS-Models},
year = {2026},
url = {https://huggingface.co/LLM-OS-Models/gdn2-370m-fineweb-edu-100b},
note = {Work in progress; checkpoints published every 5B tokens}
}
10. Acknowledgements
- The GDN-2 architecture and Triton kernels are from the Gated DeltaNet 2 authors (NVIDIA). This repo only trains their architecture.
- Training data: HuggingFaceFW/fineweb-edu (sample/100BT slice).
- Compute: 8 Γ NVIDIA H200 143 GB.
- Tracking + live status infrastructure: the
long-gdncampaign harness.