GDN-2 370M (FineWeb-Edu 100B)

A pure-recurrent linear-attention language model trained from scratch on FineWeb-Edu. Architecture: Gated DeltaNet 2 (GDN-2) β€” the recurrence of Gated DeltaNet with two channel-wise gates.

Status (latest) 🟑 In-progress pretraining β€” see the Live section below
Architecture GDN-2 (pure recurrent, no sliding-window attention)
Parameters 370 M
Training data FineWeb-Edu sample/100BT (β‰ˆ100 B English tokens, academic-focus web)
Tokenizer TinyLlama v1.1 (vocab = 32 000)
Context length 4 096 (training)
Hardware 8 Γ— NVIDIA H200 143 GB (DDP, fully sharded data parallel)
License Apache-2.0
Trained by LLM-OS-Models Β· code at gyunggyung/long-gdn

This repository publishes the checkpoints produced by the campaign described in docs/LMR_FULL_GUIDE_KO.md. A new checkpoint is uploaded roughly every 5 B trained tokens.


1. What is GDN-2?

GDN-2 (Gated DeltaNet 2) is a pure-recurrent token mixer: there is no softmax attention, no sliding-window attention, and no Transformer block in the critical path. Every layer is a learned linear-recurrent state update.

Compared to its predecessor Gated DeltaNet (KDA), GDN-2 replaces the single scalar write/erase gate with two channel-wise gates:

Stβ€…β€Š=β€…β€Š(Iβˆ’kt(btβŠ™kt)β€‰β£βŠ€) Diag(exp⁑(gt)) Stβˆ’1β€…β€Š+β€…β€Škt(wtβŠ™vt)β€‰β£βŠ€ S_t \;=\; \bigl(I - k_t (b_t \odot k_t)^{\!\top}\bigr)\,\mathrm{Diag}(\exp(g_t))\,S_{t-1} \;+\; k_t (w_t \odot v_t)^{\!\top}

  • $b_t \in \mathbb{R}^{d_k}$ β€” channel-wise erase gate (replaces KDA's scalar $\beta_t$)
  • $w_t \in \mathbb{R}^{d_v}$ β€” channel-wise write gate (new in GDN-2)
  • $g_t$ β€” output silu-gate (same as Gated DeltaNet)

Setting $b_t = \beta_t\mathbf{1}$ and $w_t = \beta_t\mathbf{1}$ recovers KDA exactly, so GDN-2 is a strict generalisation.

Why this matters for long-context: the recurrent state $S_t$ is $O(d_k \cdot d_v)$ per head β€” constant in sequence length. Training and inference scale linearly with tokens, not quadratically like softmax attention.


2. Model configuration

name              = "gdn2_370M"
block_size        = 4096          # training context length
vocab_size        = 32000         # TinyLlama tokenizer
n_layer           = 16
n_head            = 8
n_embd            = 1024
head_dim          = 128
intermediate_size = 2048          # LLaMAMLP expansion
gdn2_per_layer    = 1             # 1 = pure recurrent, no SWA fallback
local_window      = 2048          # unused when gdn2_per_layer=1
rotary_percentage = 1.0
norm              = FusedRMSNorm (eps=1e-5)
mlp               = LLaMAMLP
parallel_residual = False
mamba_init        = True

The recurrent state per head is $d_k \times d_v = 128 \times 128 = 16{,}384$ floats. Across 8 heads and 16 layers this is 2.1 M recurrent state floats, designed to match Mamba-370M's recurrent-state budget.


3. Training recipe

Hyperparameter Value
Corpus FineWeb-Edu sample/100BT
Target tokens 100 000 000 000 (100 B)
Optimizer AdamW, Ξ² = (0.9, 0.95), weight_decay = 0.1
Gradient clip 1.0
Learning rate 4 Γ— 10⁻⁴ (peak), cosine schedule
Warmup 1 Γ— 10⁹ tokens
Micro-batch Γ— GPU 8 sequences Γ— 4 096 tokens
Gradient accumulation 16
Data-parallel workers 8
Global batch 1 024 sequences = 4 194 304 tokens / step
Save interval every 1 200 steps β‰ˆ 5 B tokens
Eval interval every 960 steps β‰ˆ 4 B tokens
Eval iterations 15 batches Γ— 4 seq lengths (4 K / 8 K / 12 K / 16 K)
Eval tokenizer budget β‰ˆ 1.97 M tokens per validation pass

Measured throughput on 8 Γ— H200: 72.7 K tokens / sec / GPU (β‰ˆ 580 K tokens / sec aggregate). Wall-clock estimate end-to-end: β‰ˆ 41 hours.

The exact launch script is checked in at off/GatedDeltaNet-2/scripts/pretrain_gdn2_370m_fineweb_edu_100bt.sh.


4. Live training status

This model is mid-training. New checkpoints appear here every ~5 B tokens. The latest live status is in docs/OVERNIGHT_LIVE_STATUS_KO.md.

Milestone Step Tokens Status
First val pass (sanity, after infinite-loop fix) 960 4.0 B βœ… val_loss 2.85 / 2.83 / 2.83 / 2.84 (4 K/8 K/12 K/16 K), 96.7 s
First checkpoint + HF upload 1 200 5.0 B βœ… 2026-07-04 03:17 KST
Second checkpoint 2 400 10 B ⏳ pending
Mid-training 6 000 25 B ⏳ pending
Late-training 12 000 50 B ⏳ pending
Final 24 000 100 B ⏳ target 2026-07-05 ~05:00 KST

Checkpoint naming gotcha (will be cleaned up post-run): the milestone file checkpoint-1B-model-ckpt.pth actually contains the 5 B-token state. The "1B" suffix is the milestone index (first 5 B milestone), not the token count. Subsequent milestones will be named checkpoint-2B-…, checkpoint-3B-…, etc. The README will be updated to clarify after the run completes.


5. How to load

The checkpoint is a raw PyTorch state dict in the layout used by lit_gpt.model.GPT configured with gdn2_370M. The repo also mirrors the training code (the lit_gpt/ package from off/GatedDeltaNet-2/).

import torch
from lit_gpt.config import config_from_name
from lit_gpt.model import GPT

ckpt = torch.load("checkpoint-1B-model-ckpt.pth", map_location="cpu")
# top-level key is "model" β€” the inner state dict
state = ckpt["model"] if "model" in ckpt else ckpt

cfg = config_from_name("gdn2_370M")
model = GPT(cfg)
model.load_state_dict(state, strict=True)
model.eval()

To run a quick continuation / generation, see the off/GatedDeltaNet-2/ subproject β€” the same lit_gpt package is used for both training and inference.


6. Intended use

This model is released for research purposes only.

Appropriate uses:

  • Studying the GDN-2 recurrence and comparing against other linear / recurrent architectures (Mamba, RWKV, Gated DeltaNet, RetNet, Lightning Attention, …).
  • Long-context retrieval and associative-recall experiments where the $O(N)$ training cost matters.
  • Component-level ablations (gate design, head count, recurrent-state size).

Inappropriate uses:

  • Production deployment. The model is small (370 M), mid-training, and instruction-following has not been taught.
  • Downstream safety-critical tasks.
  • Anything requiring benchmark numbers we have not yet published. Wait for post-training evaluation.

7. Limitations (as of latest checkpoint)

  • Mid-training. Loss is still decreasing; downstream metrics will move.
  • Scale. 370 M parameters and a 4 K training context β€” small by modern standards. We chose this scale deliberately to match Mamba-370M and to fit a 36-hour campaign budget.
  • No instruction tuning. Outputs are raw next-token completions.
  • English-only training data (FineWeb-Edu is English academic web).
  • No benchmark numbers yet. HellaSwag / ARC / MMLU / RULER will be run on the final 100 B checkpoint and added here.

8. Evaluation plan (post-training)

Once the 100 B-token checkpoint lands we will run:

Suite Length Source
HellaSwag, ARC-e, ARC-c, PIQA standard lm-evaluation-harness
MMLU (5-shot) standard lm-evaluation-harness
RULER (niah, mqar, ct, cwe) 4 K / 8 K / 16 K custom loader
LongBench (retrieval subset) up to 32 K custom loader
BABILong (qa1–qa5) up to 32 K custom loader

Results will be appended to this card and to docs/LMR_PUBLIC_BENCHMARK_SUMMARY_KO.md.


9. Citation

The GDN-2 architecture was introduced by NVIDIA in 2026. Please cite the upstream GDN-2 paper for the architecture itself.

For this specific checkpoint:

@misc{gdn2_370m_fineweb_edu_100b,
  title  = {GDN-2 370M trained on FineWeb-Edu 100B tokens},
  author = {LLM-OS-Models},
  year   = {2026},
  url    = {https://huggingface.co/LLM-OS-Models/gdn2-370m-fineweb-edu-100b},
  note   = {Work in progress; checkpoints published every 5B tokens}
}

10. Acknowledgements

  • The GDN-2 architecture and Triton kernels are from the Gated DeltaNet 2 authors (NVIDIA). This repo only trains their architecture.
  • Training data: HuggingFaceFW/fineweb-edu (sample/100BT slice).
  • Compute: 8 Γ— NVIDIA H200 143 GB.
  • Tracking + live status infrastructure: the long-gdn campaign harness.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support