GDN-2 370M (FineWeb-Edu 100B)

A pure-recurrent linear-attention language model trained from scratch on FineWeb-Edu. Architecture: Gated DeltaNet 2 (GDN-2) — the recurrence of Gated DeltaNet with two channel-wise gates.


Status (latest)	🟡 In-progress pretraining — see the Live section below
Architecture	GDN-2 (pure recurrent, no sliding-window attention)
Parameters	370 M
Training data	FineWeb-Edu sample/100BT (≈100 B English tokens, academic-focus web)
Tokenizer	TinyLlama v1.1 (vocab = 32 000)
Context length	4 096 (training)
Hardware	8 × NVIDIA H200 143 GB (DDP, fully sharded data parallel)
License	Apache-2.0
Trained by	LLM-OS-Models · code at gyunggyung/long-gdn

This repository publishes the checkpoints produced by the campaign described in docs/LMR_FULL_GUIDE_KO.md. A new checkpoint is uploaded roughly every 5 B trained tokens.

1. What is GDN-2?

GDN-2 (Gated DeltaNet 2) is a pure-recurrent token mixer: there is no softmax attention, no sliding-window attention, and no Transformer block in the critical path. Every layer is a learned linear-recurrent state update.

Compared to its predecessor Gated DeltaNet (KDA), GDN-2 replaces the single scalar write/erase gate with two channel-wise gates:

$S_t \;=\; \bigl(I - k_t (b_t \odot k_t)^{\!\top}\bigr)\,\mathrm{Diag}(\exp(g_t))\,S_{t-1} \;+\; k_t (w_t \odot v_t)^{\!\top}$

$b_t \in \mathbb{R}^{d_k}$ — channel-wise erase gate (replaces KDA's scalar $\beta_t$)
$w_t \in \mathbb{R}^{d_v}$ — channel-wise write gate (new in GDN-2)
$g_t$ — output silu-gate (same as Gated DeltaNet)

Setting $b_t = \beta_t\mathbf{1}$ and $w_t = \beta_t\mathbf{1}$ recovers KDA exactly, so GDN-2 is a strict generalisation.

Why this matters for long-context: the recurrent state $S_t$ is $O(d_k \cdot d_v)$ per head — constant in sequence length. Training and inference scale linearly with tokens, not quadratically like softmax attention.

2. Model configuration

name              = "gdn2_370M"
block_size        = 4096          # training context length
vocab_size        = 32000         # TinyLlama tokenizer
n_layer           = 16
n_head            = 8
n_embd            = 1024
head_dim          = 128
intermediate_size = 2048          # LLaMAMLP expansion
gdn2_per_layer    = 1             # 1 = pure recurrent, no SWA fallback
local_window      = 2048          # unused when gdn2_per_layer=1
rotary_percentage = 1.0
norm              = FusedRMSNorm (eps=1e-5)
mlp               = LLaMAMLP
parallel_residual = False
mamba_init        = True

The recurrent state per head is $d_k \times d_v = 128 \times 128 = 16{,}384$ floats. Across 8 heads and 16 layers this is 2.1 M recurrent state floats, designed to match Mamba-370M's recurrent-state budget.

3. Training recipe

Hyperparameter	Value
Corpus	FineWeb-Edu sample/100BT
Target tokens	100 000 000 000 (100 B)
Optimizer	AdamW, β = (0.9, 0.95), weight_decay = 0.1
Gradient clip	1.0
Learning rate	4 × 10⁻⁴ (peak), cosine schedule
Warmup	1 × 10⁹ tokens
Micro-batch × GPU	8 sequences × 4 096 tokens
Gradient accumulation	16
Data-parallel workers	8
Global batch	1 024 sequences = 4 194 304 tokens / step
Save interval	every 1 200 steps ≈ 5 B tokens
Eval interval	every 960 steps ≈ 4 B tokens
Eval iterations	15 batches × 4 seq lengths (4 K / 8 K / 12 K / 16 K)
Eval tokenizer budget	≈ 1.97 M tokens per validation pass

Measured throughput on 8 × H200: 72.7 K tokens / sec / GPU (≈ 580 K tokens / sec aggregate). Wall-clock estimate end-to-end: ≈ 41 hours.

The exact launch script is checked in at off/GatedDeltaNet-2/scripts/pretrain_gdn2_370m_fineweb_edu_100bt.sh.

4. Live training status

This model is mid-training. New checkpoints appear here every ~5 B tokens. The latest live status is in docs/OVERNIGHT_LIVE_STATUS_KO.md.

Milestone	Step	Tokens	Status
First val pass (sanity, after infinite-loop fix)	960	4.0 B	✅ val_loss 2.85 / 2.83 / 2.83 / 2.84 (4 K/8 K/12 K/16 K), 96.7 s
First checkpoint + HF upload	1 200	5.0 B	✅ 2026-07-04 03:17 KST
Second checkpoint	2 400	10 B	⏳ pending
Mid-training	6 000	25 B	⏳ pending
Late-training	12 000	50 B	⏳ pending
Final	24 000	100 B	⏳ target 2026-07-05 ~05:00 KST

Checkpoint naming gotcha (will be cleaned up post-run): the milestone file checkpoint-1B-model-ckpt.pth actually contains the 5 B-token state. The "1B" suffix is the milestone index (first 5 B milestone), not the token count. Subsequent milestones will be named checkpoint-2B-…, checkpoint-3B-…, etc. The README will be updated to clarify after the run completes.

5. How to load

The checkpoint is a raw PyTorch state dict in the layout used by lit_gpt.model.GPT configured with gdn2_370M. The repo also mirrors the training code (the lit_gpt/ package from off/GatedDeltaNet-2/).

import torch
from lit_gpt.config import config_from_name
from lit_gpt.model import GPT

ckpt = torch.load("checkpoint-1B-model-ckpt.pth", map_location="cpu")
# top-level key is "model" — the inner state dict
state = ckpt["model"] if "model" in ckpt else ckpt

cfg = config_from_name("gdn2_370M")
model = GPT(cfg)
model.load_state_dict(state, strict=True)
model.eval()

To run a quick continuation / generation, see the off/GatedDeltaNet-2/ subproject — the same lit_gpt package is used for both training and inference.

6. Intended use

This model is released for research purposes only.

Appropriate uses:

Studying the GDN-2 recurrence and comparing against other linear / recurrent architectures (Mamba, RWKV, Gated DeltaNet, RetNet, Lightning Attention, …).
Long-context retrieval and associative-recall experiments where the $O(N)$ training cost matters.
Component-level ablations (gate design, head count, recurrent-state size).

Inappropriate uses:

Production deployment. The model is small (370 M), mid-training, and instruction-following has not been taught.
Downstream safety-critical tasks.
Anything requiring benchmark numbers we have not yet published. Wait for post-training evaluation.

7. Limitations (as of latest checkpoint)

Mid-training. Loss is still decreasing; downstream metrics will move.
Scale. 370 M parameters and a 4 K training context — small by modern standards. We chose this scale deliberately to match Mamba-370M and to fit a 36-hour campaign budget.
No instruction tuning. Outputs are raw next-token completions.
English-only training data (FineWeb-Edu is English academic web).
No benchmark numbers yet. HellaSwag / ARC / MMLU / RULER will be run on the final 100 B checkpoint and added here.

8. Evaluation plan (post-training)

Once the 100 B-token checkpoint lands we will run:

Suite	Length	Source
HellaSwag, ARC-e, ARC-c, PIQA	standard	`lm-evaluation-harness`
MMLU (5-shot)	standard	`lm-evaluation-harness`
RULER (niah, mqar, ct, cwe)	4 K / 8 K / 16 K	custom loader
LongBench (retrieval subset)	up to 32 K	custom loader
BABILong (qa1–qa5)	up to 32 K	custom loader

Results will be appended to this card and to docs/LMR_PUBLIC_BENCHMARK_SUMMARY_KO.md.

9. Citation

The GDN-2 architecture was introduced by NVIDIA in 2026. Please cite the upstream GDN-2 paper for the architecture itself.

For this specific checkpoint:

@misc{gdn2_370m_fineweb_edu_100b,
  title  = {GDN-2 370M trained on FineWeb-Edu 100B tokens},
  author = {LLM-OS-Models},
  year   = {2026},
  url    = {https://huggingface.co/LLM-OS-Models/gdn2-370m-fineweb-edu-100b},
  note   = {Work in progress; checkpoints published every 5B tokens}
}

10. Acknowledgements

The GDN-2 architecture and Triton kernels are from the Gated DeltaNet 2 authors (NVIDIA). This repo only trains their architecture.
Training data: HuggingFaceFW/fineweb-edu (sample/100BT slice).
Compute: 8 × NVIDIA H200 143 GB.
Tracking + live status infrastructure: the long-gdn campaign harness.

Downloads last month: -; Downloads are not tracked for this model. How to track