Upgrade to modded-nanoGPT + Muon checkpoint (val 2.65 -> 2.45)

Browse files

Files changed (4) hide show

README.md +33 -37
config.json +5 -2
model.py +64 -18
tinystories-25m.pt +2 -2

README.md CHANGED Viewed

@@ -12,7 +12,8 @@ tags:
   - pytorch
   - rope
   - gqa
-  - swiglu
   - multi-token-prediction
 pipeline_tag: text-generation
 ---
@@ -21,34 +22,40 @@ pipeline_tag: text-generation
 A small (~19.2M parameter) decoder-only GPT trained **from scratch** on
 [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes
-simple, coherent children's stories and is meant as a compact, hackable reference
-for modern LLM architecture techniques — small enough to train end-to-end in a few
-minutes on a consumer GPU (RTX 2060 Super, 8 GB).
 ## Sample output
-> **Once upon a time,** there was a little girl named Lily. She loved to play with
-> her dolls and sing songs. One day, she went to the park to play with her friends.
-> She saw a boy playing with a toy car and asked why he played too much...
-> **Lily and Tom went to the park and** played on the swings. They had a lot of fun.
-> They played with their toys and had a lot of fun. They also learned to be good and
-> not judge others. They were happy.
 ## Architecture
-A LLaMA-style decoder-only transformer with several modern techniques wired in:
 | Component | Choice |
 |---|---|
 | Layers / heads / dim | 8 layers, 6 heads, `n_embd` 384 |
 | Context length | 256 tokens |
 | Vocabulary | 16,384 (ByteLevel BPE) |
-| Position encoding | **RoPE** (rotary embeddings) |
-| Attention | **Grouped-Query Attention** (2 KV heads) |
-| MLP | **SwiGLU** |
 | Normalization | **RMSNorm** |
-| Extra heads | **Multi-Token Prediction** (2 auxiliary heads) for sample efficiency |
 | Weight tying | token embedding ↔ output head (and MTP heads) |
 ## Training
@@ -57,20 +64,17 @@ A LLaMA-style decoder-only transformer with several modern techniques wired in:
 |---|---|
 | Dataset | TinyStories (~2.1M stories) |
 | Steps | 3,000 |
-| Batch | 32 × 256 tokens |
-| Optimizer | AdamW, cosine schedule, 200-step warmup, peak LR 6e-4 |
-| Precision | fp16 mixed precision |
-| Hardware | 1× RTX 2060 Super (8 GB), ~7 minutes |
-| Throughput | ~57K tokens/sec |
-| Final loss | 2.62 (combined next-token + MTP auxiliary) |
-| Validation loss | 2.65 |
-This is a lightly trained demo checkpoint; longer training lowers loss further.
 ## Usage
-This is a **custom architecture**, so you need `model.py` from this repo (it's small
-and dependency-light). Download it next to your script, then:
 ```python
 import torch
@@ -105,22 +109,14 @@ print(tok.decode(out[0].tolist()))
 ## Limitations
-- Trained only on TinyStories — vocabulary and style are limited to simple
-  children's-story English. It is not a general-purpose assistant.
-- Small and lightly trained: it repeats phrases and occasionally drifts or
-  contradicts itself (e.g. swapping character names).
 - 256-token context.
-## Source
-Trained with the "train a language model from scratch" project — a from-scratch GPT
-with independently configurable modern techniques (RoPE, GQA, SwiGLU, RMSNorm, MTP,
-mHC, BitNet, TurboQuant) plus Muon/AdamW optimizers and speculative decoding.
 ## References
 - [TinyStories](https://arxiv.org/abs/2305.07759)
 - [RoFormer / RoPE](https://arxiv.org/abs/2104.09864)
 - [GQA](https://arxiv.org/abs/2305.13245)
-- [GLU Variants / SwiGLU](https://arxiv.org/abs/2002.05202)
 - [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437)

   - pytorch
   - rope
   - gqa
+  - qk-norm
+  - muon
   - multi-token-prediction
 pipeline_tag: text-generation
 ---
 A small (~19.2M parameter) decoder-only GPT trained **from scratch** on
 [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes
+simple, coherent children's stories and is a compact, hackable reference for modern
+LLM architecture + optimization techniques — trained end-to-end in a few minutes on a
+single consumer GPU (RTX 2060 Super, 8 GB).
+This checkpoint uses the **modded-nanoGPT-style recipe**: trained with the **Muon**
+optimizer and **QK-Norm + squared-ReLU MLP + logit soft-capping**, which improved
+validation loss from 2.65 to **2.45** versus a plain AdamW/SwiGLU baseline at the same
+3,000 steps.
 ## Sample output
+> **Once upon a time,** there was a little girl named Lily. She loved to play outside
+> and explore the world around her. One day, she found a long piece of cardboard on the
+> floor. It was a big, white box with a bow on it. She picked it up and opened it. Inside
+> the box, she found a toy car...
+> **Lily and Tom went to the park and** saw a man with a big hat and a big smile. He was
+> very nice... "Sure, you can play with us," Lily said. They played tag and hide and seek.
 ## Architecture
+A LLaMA-/modded-nanoGPT-style decoder-only transformer:
 | Component | Choice |
 |---|---|
 | Layers / heads / dim | 8 layers, 6 heads, `n_embd` 384 |
 | Context length | 256 tokens |
 | Vocabulary | 16,384 (ByteLevel BPE) |
+| Position encoding | **RoPE** |
+| Attention | **Grouped-Query Attention** (2 KV heads) + **QK-Norm** |
+| MLP | **squared-ReLU** (ungated) |
 | Normalization | **RMSNorm** |
+| Logits | **soft-capped** at 15 (`cap·tanh(logits/cap)`) |
+| Extra heads | **Multi-Token Prediction** (2 auxiliary heads) |
 | Weight tying | token embedding ↔ output head (and MTP heads) |
 ## Training
 |---|---|
 | Dataset | TinyStories (~2.1M stories) |
 | Steps | 3,000 |
+| Batch | 40 × 256 tokens |
+| Optimizer | **Muon** (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule |
+| Precision | fp16 mixed precision, `torch.compile` |
+| Hardware | 1× RTX 2060 Super (8 GB), ~11 minutes (~47K tokens/sec) |
+| Train loss | 2.47 (combined next-token + MTP auxiliary) |
+| **Validation loss** | **2.45** (perplexity 11.5) |
 ## Usage
+This is a **custom architecture**, so you need `model.py` from this repo (small,
+dependency-light). Download it next to your script, then:
 ```python
 import torch
 ## Limitations
+- Trained only on TinyStories — simple children's-story English, not a general assistant.
+- Small and lightly trained: occasional repetition, name swaps, or drift.
 - 256-token context.
 ## References
 - [TinyStories](https://arxiv.org/abs/2305.07759)
 - [RoFormer / RoPE](https://arxiv.org/abs/2104.09864)
 - [GQA](https://arxiv.org/abs/2305.13245)
 - [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437)
+- [Muon optimizer](https://kellerjordan.github.io/posts/muon/) · [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)

config.json CHANGED Viewed

@@ -6,10 +6,13 @@
   "n_layer": 8,
   "use_rope": true,
   "n_kv_head": 2,
-  "use_swiglu": true,
   "use_rmsnorm": true,
   "use_mtp": true,
   "mtp_heads": 2,
   "mtp_weight": 0.1,
-  "tie_mtp_lm_head": true
 }

   "n_layer": 8,
   "use_rope": true,
   "n_kv_head": 2,
+  "use_swiglu": false,
   "use_rmsnorm": true,
   "use_mtp": true,
   "mtp_heads": 2,
   "mtp_weight": 0.1,
+  "tie_mtp_lm_head": true,
+  "use_relu2": true,
+  "use_qk_norm": true,
+  "logit_cap": 15.0
 }

model.py CHANGED Viewed

@@ -5,6 +5,13 @@ import math
 from torch.utils.checkpoint import checkpoint
 # --- mHC: Manifold-Constrained Hyper-Connections ---
 def sinkhorn(log_alpha, n_iters=5):
@@ -264,28 +271,26 @@ class MTPHead(nn.Module):
         self.future_idx = future_idx
         n_embd = config["n_embd"]
         vocab_size = config["vocab_size"]
         self.proj = nn.Linear(n_embd, n_embd)
         self.ln = nn.LayerNorm(n_embd)
         self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
     def forward(self, hidden, targets=None):
         if targets is not None:
             shift = self.future_idx
-            if targets.size(1) <= shift:
-                return None, None
-            # Only the first T-shift positions have a future target, so project
-            # just those instead of the full sequence (saves a vocab matmul slice).
-            h = self.ln(self.proj(hidden[:, :-shift]))
-            logits = self.lm_head(h)
-            targets_shifted = targets[:, shift:]
-            loss = F.cross_entropy(
-                logits.reshape(-1, logits.size(-1)),
-                targets_shifted.reshape(-1),
-                ignore_index=-1,
-            )
-            return logits, loss
-        h = self.ln(self.proj(hidden))
-        return self.lm_head(h), None
 # --- RoPE: Rotary Position Embeddings ---
@@ -339,6 +344,23 @@ class SwiGLU(nn.Module):
         return self.down(F.silu(self.gate(x)) * self.up(x))
 # --- Core model ---
 def make_norm(n_embd, use_rmsnorm=False):
@@ -359,6 +381,7 @@ class CausalSelfAttention(nn.Module):
             raise ValueError(f"n_head ({self.n_head}) must be divisible by n_kv_head ({self.n_kv_head})")
         self.head_dim = self.n_embd // self.n_head
         self.use_rope = config.get("use_rope", False)
         use_bitnet = config.get("use_bitnet", False)
         use_fast_bitnet = config.get("use_fast_bitnet", False)
@@ -367,6 +390,11 @@ class CausalSelfAttention(nn.Module):
         self.v_proj = make_linear(self.n_embd, self.n_kv_head * self.head_dim, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
         self.proj = make_linear(self.n_embd, self.n_embd, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
         if self.use_rope:
             self.rope = RotaryEmbedding(self.head_dim, max_seq_len=config.get("block_size", 512))
@@ -376,6 +404,10 @@ class CausalSelfAttention(nn.Module):
         k = self.k_proj(x).view(B, T, self.n_kv_head, self.head_dim).transpose(1, 2)
         v = self.v_proj(x).view(B, T, self.n_kv_head, self.head_dim).transpose(1, 2)
         if self.use_rope:
             cos, sin = self.rope(pos_offset + T)
             cos, sin = cos[pos_offset:pos_offset + T], sin[pos_offset:pos_offset + T]
@@ -416,7 +448,9 @@ class Block(nn.Module):
         self.ln1 = make_norm(config["n_embd"], use_rmsnorm)
         self.attn = CausalSelfAttention(config)
         self.ln2 = make_norm(config["n_embd"], use_rmsnorm)
-        if config.get("use_swiglu", False):
             self.mlp = SwiGLU(config)
         else:
             self.mlp = MLP(config)
@@ -452,6 +486,7 @@ class GPT(nn.Module):
         self.use_turboquant = config.get("use_turboquant", False)
         self.turboquant_bits = config.get("turboquant_bits", 4)
         self.use_activation_checkpointing = config.get("use_activation_checkpointing", False)
         use_rmsnorm = config.get("use_rmsnorm", False)
         self.tok_emb = nn.Embedding(config["vocab_size"], config["n_embd"])
@@ -513,7 +548,7 @@ class GPT(nn.Module):
     def forward(self, idx, targets=None, return_hidden=False):
         hidden = self._compute_hidden(idx)
-        logits = self.lm_head(hidden)
         loss = None
         if targets is not None:
             loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
@@ -536,7 +571,7 @@ class GPT(nn.Module):
             for block, cache in zip(self.blocks, kv_caches or [None] * len(self.blocks)):
                 x = block(x, kv_cache=cache, pos_offset=pos_offset)
         hidden = self.ln_f(x)
-        logits = self.lm_head(hidden)
         if return_hidden:
             return logits, hidden
         return logits
@@ -916,6 +951,16 @@ FAST_2060_MTP_FBITNET_CONFIG = {
     "use_fast_bitnet": True,
 }
 FAST_2060_MTP_TURBO_CONFIG = {
     **FAST_2060_MTP_CONFIG,
     "use_turboquant": True,
@@ -957,6 +1002,7 @@ CONFIGS = {
     "fast_2060": FAST_2060_CONFIG,
     "fast_2060_mtp": FAST_2060_MTP_CONFIG,
     "fast_2060_mtp_fbitnet": FAST_2060_MTP_FBITNET_CONFIG,
     "fast_2060_mtp_turbo": FAST_2060_MTP_TURBO_CONFIG,
     "tiny_fast": TINY_FAST_CONFIG,
     "low_memory_2060": LOW_MEMORY_2060_CONFIG,

 from torch.utils.checkpoint import checkpoint
+def soft_cap(logits, cap):
+    """Gemma2/modded-nanoGPT logit soft-capping: cap * tanh(logits / cap). No-op if cap falsy."""
+    if cap:
+        return cap * torch.tanh(logits / cap)
+    return logits
 # --- mHC: Manifold-Constrained Hyper-Connections ---
 def sinkhorn(log_alpha, n_iters=5):
         self.future_idx = future_idx
         n_embd = config["n_embd"]
         vocab_size = config["vocab_size"]
+        self.logit_cap = config.get("logit_cap", 0)
         self.proj = nn.Linear(n_embd, n_embd)
         self.ln = nn.LayerNorm(n_embd)
         self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
     def forward(self, hidden, targets=None):
+        h = self.ln(self.proj(hidden))
+        logits = soft_cap(self.lm_head(h), self.logit_cap)
+        loss = None
         if targets is not None:
             shift = self.future_idx
+            if targets.size(1) > shift:
+                logits_shifted = logits[:, :-shift].contiguous()
+                targets_shifted = targets[:, shift:].contiguous()
+                loss = F.cross_entropy(
+                    logits_shifted.view(-1, logits_shifted.size(-1)),
+                    targets_shifted.view(-1),
+                    ignore_index=-1,
+                )
+        return logits, loss
 # --- RoPE: Rotary Position Embeddings ---
         return self.down(F.silu(self.gate(x)) * self.up(x))
+class ReLU2MLP(nn.Module):
+    """Ungated MLP with squared-ReLU activation (modded-nanoGPT). Simpler and a bit
+    faster than SwiGLU; competitive quality at small scale."""
+    def __init__(self, config):
+        super().__init__()
+        n_embd = config["n_embd"]
+        hidden = 4 * n_embd
+        use_bitnet = config.get("use_bitnet", False)
+        use_fast_bitnet = config.get("use_fast_bitnet", False)
+        self.fc = make_linear(n_embd, hidden, bias=False, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
+        self.proj = make_linear(hidden, n_embd, bias=False, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
+    def forward(self, x):
+        return self.proj(F.relu(self.fc(x)).square())
 # --- Core model ---
 def make_norm(n_embd, use_rmsnorm=False):
             raise ValueError(f"n_head ({self.n_head}) must be divisible by n_kv_head ({self.n_kv_head})")
         self.head_dim = self.n_embd // self.n_head
         self.use_rope = config.get("use_rope", False)
+        self.use_qk_norm = config.get("use_qk_norm", False)
         use_bitnet = config.get("use_bitnet", False)
         use_fast_bitnet = config.get("use_fast_bitnet", False)
         self.v_proj = make_linear(self.n_embd, self.n_kv_head * self.head_dim, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
         self.proj = make_linear(self.n_embd, self.n_embd, use_bitnet=use_bitnet, use_fast_bitnet=use_fast_bitnet)
+        # QK-Norm (modded-nanoGPT): RMSNorm Q and K over the head dim before attention.
+        if self.use_qk_norm:
+            self.q_norm = nn.RMSNorm(self.head_dim)
+            self.k_norm = nn.RMSNorm(self.head_dim)
         if self.use_rope:
             self.rope = RotaryEmbedding(self.head_dim, max_seq_len=config.get("block_size", 512))
         k = self.k_proj(x).view(B, T, self.n_kv_head, self.head_dim).transpose(1, 2)
         v = self.v_proj(x).view(B, T, self.n_kv_head, self.head_dim).transpose(1, 2)
+        if self.use_qk_norm:
+            q = self.q_norm(q)
+            k = self.k_norm(k)
         if self.use_rope:
             cos, sin = self.rope(pos_offset + T)
             cos, sin = cos[pos_offset:pos_offset + T], sin[pos_offset:pos_offset + T]
         self.ln1 = make_norm(config["n_embd"], use_rmsnorm)
         self.attn = CausalSelfAttention(config)
         self.ln2 = make_norm(config["n_embd"], use_rmsnorm)
+        if config.get("use_relu2", False):
+            self.mlp = ReLU2MLP(config)
+        elif config.get("use_swiglu", False):
             self.mlp = SwiGLU(config)
         else:
             self.mlp = MLP(config)
         self.use_turboquant = config.get("use_turboquant", False)
         self.turboquant_bits = config.get("turboquant_bits", 4)
         self.use_activation_checkpointing = config.get("use_activation_checkpointing", False)
+        self.logit_cap = config.get("logit_cap", 0)
         use_rmsnorm = config.get("use_rmsnorm", False)
         self.tok_emb = nn.Embedding(config["vocab_size"], config["n_embd"])
     def forward(self, idx, targets=None, return_hidden=False):
         hidden = self._compute_hidden(idx)
+        logits = soft_cap(self.lm_head(hidden), self.logit_cap)
         loss = None
         if targets is not None:
             loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
             for block, cache in zip(self.blocks, kv_caches or [None] * len(self.blocks)):
                 x = block(x, kv_cache=cache, pos_offset=pos_offset)
         hidden = self.ln_f(x)
+        logits = soft_cap(self.lm_head(hidden), self.logit_cap)
         if return_hidden:
             return logits, hidden
         return logits
     "use_fast_bitnet": True,
 }
+# modded-nanoGPT-style recipe. QK-Norm helps under any optimizer; ReLU2 and
+# logit_cap only pay off paired with Muon's higher LR. Train with --optimizer muon.
+FAST_2060_MODDED_CONFIG = {
+    **FAST_2060_MTP_CONFIG,
+    "use_swiglu": False,   # superseded by ReLU2 below
+    "use_relu2": True,
+    "use_qk_norm": True,
+    "logit_cap": 15.0,
+}
 FAST_2060_MTP_TURBO_CONFIG = {
     **FAST_2060_MTP_CONFIG,
     "use_turboquant": True,
     "fast_2060": FAST_2060_CONFIG,
     "fast_2060_mtp": FAST_2060_MTP_CONFIG,
     "fast_2060_mtp_fbitnet": FAST_2060_MTP_FBITNET_CONFIG,
+    "fast_2060_modded": FAST_2060_MODDED_CONFIG,
     "fast_2060_mtp_turbo": FAST_2060_MTP_TURBO_CONFIG,
     "tiny_fast": TINY_FAST_CONFIG,
     "low_memory_2060": LOW_MEMORY_2060_CONFIG,

tinystories-25m.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f08fa57d4360cd654e407322bce66695018c5b9b673df8be5f8c9f5631fe3103
-size 76793291

 version https://git-lfs.github.com/spec/v1
+oid sha256:69375d07a06ef3b325f3189b23b0caf21a7983fc1e87316b0f5651c579331af3
+size 76800459