Initial microGPT upload

Browse files

Files changed (5) hide show

README.md +473 -0
ckpt.pt +3 -0
inference.py +114 -0
model.py +152 -0
tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,473 @@

+---
+license: mit
+language:
+- en
+tags:
+- text-generation
+- transformer
+- educational
+- tiny-llm
+- from-scratch
+- decoder-only
+- gpt
+datasets:
+- roneneldan/TinyStories
+pipeline_tag: text-generation
+library_name: pytorch
+model-index:
+- name: microgpt
+  results:
+  - task:
+      type: text-generation
+      name: Story completion
+    dataset:
+      name: TinyStories (validation split)
+      type: roneneldan/TinyStories
+    metrics:
+    - type: cross-entropy
+      value: 2.25
+      name: Validation cross-entropy loss
+    - type: perplexity
+      value: 9.49
+      name: Validation perplexity
+---
+# microGPT
+A **1.35M-parameter decoder-only transformer** trained from scratch on the
+[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.
+The entire training run took roughly two hours on an Apple Silicon laptop.
+At ~50,000× smaller than GPT-3, it can still produce coherent simple
+children's stories.
+This is an **educational artifact**, not a production model. Its purpose is
+to make every component of a modern LLM legible, debuggable, and rebuildable
+on consumer hardware.
+---
+## Quick facts
+| | |
+|---|---|
+| **Architecture** | Decoder-only transformer (GPT-style) |
+| **Parameters** | 1,345,792 trainable (1.35M) |
+| **File size on disk** | ~5.1 MB (float32) |
+| **Training data** | ~470M tokens of TinyStories |
+| **Training compute** | ~1.5 hours on Apple Silicon (MPS) |
+| **Final val loss** | 2.25 (perplexity 9.49) |
+| **Context window** | 256 tokens |
+| **Tokenizer** | Byte-level BPE, vocab=4096 |
+| **License** | MIT |
+---
+## Architecture in detail
+```
+Input tokens (B, T)
+    │
+    ├─► Token Embedding   (4096 → 128)
+    │                          │
+    └─► Position Embedding ────┘ ← element-wise sum
+            │
+            ▼  (B, T, 128)
+   ┌──── Block × 4 ────────────────────────────┐
+   │                                            │
+   │   x = LayerNorm(x)                         │
+   │   x = x + CausalSelfAttention(x)  ← 4 heads│
+   │   x = LayerNorm(x)                         │
+   │   x = x + MLP(x)                  ← 128→512→128, GELU
+   │                                            │
+   └────────────────────────────────────────────┘
+            │
+            ▼  (B, T, 128)
+        LayerNorm
+            │
+            ▼
+   Linear (128 → 4096)   ← weight-tied with token embedding
+            │
+            ▼  (B, T, 4096)
+        Logits
+```
+| Hyperparameter | Value | Notes |
+|---|---|---|
+| `n_layers` | 4 | Stacked transformer blocks |
+| `d_model` | 128 | Hidden dimension |
+| `n_heads` | 4 | Each head is 128/4 = 32 dim |
+| `head_dim` | 32 | Per-head dimensionality |
+| `ffn_dim` | 512 | MLP intermediate width (4×d_model) |
+| `ctx_len` | 256 | Maximum input length in tokens |
+| `vocab_size` | 4,096 | BPE-derived vocabulary |
+| Normalization | LayerNorm | Pre-LN (applied before sublayers) |
+| Position encoding | Learned | Absolute, additive |
+| Activation | GELU | In the MLP |
+| Attention | Multi-head, causal | Implemented via `F.scaled_dot_product_attention` |
+| Embedding tying | Yes | Output projection shares weight with `tok_emb` |
+| Bias on linear layers | No | Following common modern practice |
+| Dropout | 0.1 (training) | 0.0 at inference |
+### Parameter breakdown — where the 1.35M live
+| Component | Shape | Params | % |
+|---|---|---|---|
+| Token embeddings (`tok_emb.weight`) | (4096, 128) | 524,288 | 38.9% |
+| Position embeddings (`pos_emb.weight`) | (256, 128) | 32,768 | 2.4% |
+| 4 × transformer block | — | 788,480 | 58.6% |
+|     └─ Per block: `ln1` (γ, β) | (128,) × 2 | 256 | |
+|     └─ Per block: `attn.qkv` | (384, 128) | 49,152 | |
+|     └─ Per block: `attn.proj` | (128, 128) | 16,384 | |
+|     └─ Per block: `ln2` (γ, β) | (128,) × 2 | 256 | |
+|     └─ Per block: `mlp.fc1` | (512, 128) | 65,536 | |
+|     └─ Per block: `mlp.fc2` | (128, 512) | 65,536 | |
+| Final LayerNorm (`ln_f`) | (128,) × 2 | 256 | 0.02% |
+| Output projection (`head.weight`) | (4096, 128) | 0 | tied |
+| **Total** | | **1,345,792** | |
+Two observations worth absorbing:
+- **Embeddings are 41% of total parameters** at this scale. This is typical of small models — the vocab × d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
+- **MLPs (`fc1` + `fc2`) account for half of every block's params**: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.
+---
+## Training
+### Data
+- **Dataset:** [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023)
+- **Stories:** ~2.1M (train) + ~22K (validation)
+- **Tokens (after BPE):** ~470M (train) + ~5M (validation)
+- **Why TinyStories specifically:** synthetic dataset designed so vocabulary
+  and grammar stay within what a 3–4 year-old understands, making coherent
+  generation possible at very small model scales. Without this curation, a
+  1.35M-param model on general web text produces gibberish.
+### Tokenizer
+- **Type:** byte-level Byte-Pair Encoding (BPE)
+- **Vocabulary:** 4,096 tokens (including special tokens `<unk>`, `<eos>`)
+- **Trained on:** 50,000 stories from the train split (vocab converges
+  quickly; full corpus produces a near-identical tokenizer)
+- **Avg compression:** ~4 characters per token on TinyStories text
+### Optimization
+| Hyperparameter | Value |
+|---|---|
+| Optimizer | AdamW |
+| β₁, β₂ | 0.9, 0.95 |
+| Weight decay | 0.1 |
+| Peak learning rate | 3e-4 |
+| Min learning rate | 3e-5 |
+| Schedule | Linear warmup (200 steps) → cosine decay |
+| Batch size (sequences) | 64 |
+| Sequence length | 256 |
+| Tokens per step | 16,384 |
+| Total steps | 20,000 |
+| Total tokens seen | ~327M |
+| Gradient clipping | 1.0 (global L2 norm) |
+| Random seed | 1337 |
+### Hardware & wall-clock
+| | |
+|---|---|
+| Hardware | Apple M-series laptop (MPS backend) |
+| Precision | float32 |
+| Wall-clock | ~1.5 hours |
+| Peak memory | ~1.5 GB |
+| Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint |
+---
+## Evaluation
+### Held-out validation loss
+| Step | Val loss | Perplexity |
+|---|---|---|
+| 0 (init) | 8.32 | 4096 |
+| ~17,500 | 2.26 | 9.59 |
+| ~20,000 | **2.25** | **9.49** |
+For context: a uniform random predictor over the 4,096-token vocabulary
+would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of
+9.49 means the model has effectively narrowed each prediction down to
+~9.5 plausible next-token choices, on average.
+> **Note on missing checkpoint progression.** This release contains only the
+> final checkpoint. The original plan was to capture step-tagged snapshots
+> (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability *emergence*
+> across training — gibberish → words → grammar → narrative. A monitoring
+> script was started after training had already passed the early steps, so
+> only the final checkpoint was preserved. A v0.2 release with multi-checkpoint
+> snapshots and an emergence comparison is planned.
+### Capability probes
+Three diagnostic prompts designed to probe specific capabilities, evaluated
+on the final checkpoint (step 20,000, T=0.8, top-k=40):
+**Short-range coherence** — *"Lily picked up the red ball and"*
+> ...put it on the floor. She threw the red ball with her hands and pulled
+> it onto the floor. She tried to make it shine, but it made her feel good.
+> "Oh no, no!" she shouted. "Sharing is very dangerous!"
+Pronoun continuity holds; ball persists as object. ✓
+**Causal reasoning** — *"Tom was very sad because"*
+> ...he had no one to play. He thought that it would be fun and share his
+> toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"
+"because" produces a real cause of sadness; emotional consistency holds for
+2–3 sentences before drifting. ✓ (with caveats)
+**Narrative composition** — *"The little dragon wanted to fly but"*
+> ...her mommy said no. The bear was very sad that he was gone. He wanted
+> to fly anymore and get lost.
+Initial obstacle is set up correctly, but the model loses track of which
+character is which (dragon → bear → "he"). ✗
+This pattern — local coherence ✓, multi-sentence composition partial — is
+expected at this scale. Narrative arc requires planning across many tokens,
+which is one of the last capabilities to emerge in language models even at
+frontier scale.
+---
+## Intended use
+**In scope:**
+- Educational reference for the GPT-style transformer architecture
+- Demonstration of end-to-end LLM training on consumer hardware
+- Generating short, simple, TinyStories-style English children's narratives
+- Exploring how sampling parameters (temperature, top-k, top-p) affect output
+- Comparison baseline for tiny-model research
+**Out of scope:**
+- General-purpose text generation (vocabulary is restricted to TinyStories)
+- Question answering, instruction following, or chat (no SFT or RLHF stage)
+- Anything requiring factual accuracy (no factual grounding)
+- Non-English text (English-only training data)
+- Long-form generation (256-token context window)
+---
+## Limitations and biases
+- **Distribution lock-in:** Trained exclusively on synthetic children's
+  stories. Generation outside this distribution (e.g., technical text,
+  adult themes, dialogue formats) will be incoherent.
+- **No instruction following:** This is a base model — pre-training only.
+  It completes text; it does not answer questions or follow instructions.
+- **Hallucination:** No factual grounding. The model has no concept of
+  "I don't know" — it produces the most statistically plausible
+  continuation, which is often false outside the training distribution.
+- **Context window:** 256 tokens is too short to model long dependencies.
+- **Synthetic data biases:** TinyStories was generated by GPT-3.5/4 with
+  prompted constraints, so it inherits some of that generator's stylistic
+  patterns and any biases encoded therein.
+- **No safety training:** No RLHF, no Constitutional AI, no content
+  filtering. While the training data is innocuous, prompts that
+  push toward harmful outputs receive no safeguards.
+- **Memorization vs generalization:** Some completions ("She was very
+  happy and they played all day") are likely memorized stylistic
+  patterns rather than novel generation.
+---
+## How to use
+### Inference
+```python
+from inference import NanoSLMInference
+slm = NanoSLMInference("ckpt.pt", "tokenizer.json")
+text = slm.generate(
+    "Once upon a time, there was a little",
+    max_new_tokens=200,
+    temperature=0.8,
+    top_k=40,
+)
+print(text)
+```
+### Sampling parameters
+| Parameter | Effect |
+|---|---|
+| `temperature` | Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0. |
+| `top_k` | Keep only the *k* highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100. |
+| `top_p` (nucleus) | Keep the smallest set of tokens with cumulative probability ≥ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95. |
+| `seed` | Sets PyTorch RNG for reproducibility. |
+---
+## How this model is served
+A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo).
+The serving stack is intentionally minimal:
+```
+User browser
+    ↓ HTTPS
+HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
+    ↓
+Gradio + FastAPI/uvicorn
+    ↓
+PyTorch eager-mode forward pass on CPU
+    ↓
+Autoregressive token generation, one token per pass
+```
+Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free
+CPU**, **~0.5 seconds on Apple M-series with MPS**.
+What this serving setup deliberately does *not* implement (each is a separate
+upgrade and a useful learning exercise):
+- **KV-caching** — every generation step re-processes all prior tokens.
+  A real implementation caches K/V tensors and pays only for the new token.
+- **Continuous batching** — multiple users would queue serially. Production
+  servers (vLLM, TGI) batch concurrent requests dynamically.
+- **Quantization** — weights are float32. int8/int4 would shrink memory ~4×.
+- **Compiled graphs** — eager-mode PyTorch leaves performance on the table
+  vs `torch.compile()`, ONNX Runtime, or a dedicated engine.
+For a model this small the overheads don't matter. At any production scale,
+*every one of the above becomes critical to unit economics*.
+---
+## Comparison with frontier models
+The architecture is structurally identical to GPT-2/3, Llama, Mistral, and
+Claude. The differences below are evolutionary refinements, not categorical
+changes — the core "decoder-only transformer trained with next-token
+prediction" recipe is the same.
+| | microGPT (this) | Llama 3 70B |
+|---|---|---|
+| Parameters | 1.35M | 70B (~52,000× larger) |
+| Layers | 4 | 80 |
+| `d_model` | 128 | 8,192 |
+| Heads | 4 (multi-head) | 64 (grouped-query attention) |
+| Context | 256 | 128,000 |
+| Vocab | 4,096 | 128,256 |
+| Position | Learned absolute | Rotary (RoPE) |
+| Activation | GELU | SwiGLU |
+| Normalization | LayerNorm | RMSNorm |
+| Training tokens | ~327M | ~15T (~46,000× more) |
+| Training compute | ~5 kWh laptop | many MW-months on H100 clusters |
+---
+## Glossary
+A short reference for the terminology used above. Worth absorbing — these
+terms come up constantly in AI literature and interviews.
+**Parameter / weight.** A single learnable number stored in the model.
+Updated during training, read during inference. A "1.35M parameter model"
+literally has 1.35M of these numbers.
+**Embedding.** A learned vector representation of a discrete object (token,
+position). Implemented as a lookup table.
+**Token.** The atomic unit of text the model operates on. Produced by the
+tokenizer; typically ~4 characters of English per token for byte-level BPE.
+**Tokenizer.** The deterministic, reversible function that converts strings
+to integer ID sequences and back. Decisions made here (vocab size, BPE
+merges) propagate through the entire model.
+**BPE (Byte-Pair Encoding).** A subword tokenization algorithm that
+iteratively merges the most frequent adjacent pairs of symbols into new
+vocabulary entries.
+**Logits.** The raw, unnormalized scores the model outputs — one per
+vocabulary token at each position. Becomes a probability distribution after
+softmax.
+**Softmax.** Function that converts logits to probabilities by exponentiating
+and normalizing.
+**Cross-entropy loss.** The training objective: how surprised the model is
+by the correct next token. Equals 0 if the model assigned probability 1 to
+the right answer; equals `ln(vocab_size)` if the model is uniformly
+uninformed.
+**Perplexity.** `exp(loss)`. The "effective number of choices" the model is
+deciding between. Useful because it has a more intuitive scale than loss.
+**Decoder-only / autoregressive.** The model only attends to past tokens
+(causal mask), and generates one token at a time conditioned on what it has
+already produced.
+**Self-attention.** The mechanism by which each position computes a
+weighted combination of all (allowed) other positions, where the weights
+depend on the content at each position.
+**Multi-head attention.** Self-attention computed in parallel across `n`
+subspaces ("heads"), each with `d_model / n` dimensions. Different heads
+empirically learn to specialize.
+**KV cache.** At inference time, the Key and Value tensors from previous
+tokens can be cached and reused, avoiding redundant computation. Critical
+for production serving; not implemented in this model.
+**Pre-LayerNorm.** Applying LayerNorm *before* the attention/MLP sublayers,
+not after. Stabilizes training of deep transformers.
+**Weight tying.** Sharing parameters between the input embedding matrix and
+the output projection matrix. Saves memory; usually improves quality.
+**Cosine learning-rate schedule.** Learning rate ramps up linearly during
+warmup, then decays following a cosine curve. Standard for transformer
+training.
+**Gradient clipping.** Capping the global L2 norm of gradients during
+backpropagation to prevent destabilizing weight updates.
+**MPS (Metal Performance Shaders).** Apple's GPU acceleration backend for
+PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.
+**Pre-training.** The stage of training described here: minimize next-token
+prediction loss on a large corpus. Produces a *base model*.
+**SFT (Supervised Fine-Tuning).** A subsequent training stage on
+`(instruction, ideal response)` pairs. Teaches the model to follow
+instructions. Not done for this model.
+**RLHF (Reinforcement Learning from Human Feedback).** A further training
+stage using preference data. Aligns model behavior with human preferences.
+Not done for this model.
+---
+## Citation
+If this model or its companion code helped you, please cite or link to:
+```
+@misc{microgpt,
+  author = {Brett Lee Hary},
+  title  = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
+  year   = {2026},
+  howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
+}
+```
+### Acknowledgements
+- Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) — the
+  reference implementation that made this approachable.
+- Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) — the dataset and the insight that data quality can substitute for model scale.
+- Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — the original transformer.
+- The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for
+  the infrastructure that makes projects like this trivial to share.

ckpt.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a503409e144a80c461d97b9462ee76236e663d54499afd6bb39ce1230c68f31
+size 5394041

inference.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""
+Inference helper for Nano-SLM.
+Wraps the model + tokenizer into a clean `generate()` function suitable for
+demos, notebooks, or a Gradio interface.
+Usage:
+    from inference import NanoSLMInference
+    slm = NanoSLMInference("out/ckpt.pt", "data/tokenizer.json")
+    text = slm.generate("Once upon a time", max_new_tokens=200, temperature=0.8)
+    print(text)
+"""
+import torch
+import torch.nn.functional as F
+from tokenizers import Tokenizer
+from model import NanoSLM
+# Must match the architecture used during training.
+DEFAULT_CFG = dict(
+    vocab_size=4096, d_model=128, n_heads=4, n_layers=4,
+    ffn_dim=512, ctx_len=256, dropout=0.0,
+)
+class NanoSLMInference:
+    def __init__(self, ckpt_path, tokenizer_path, device=None, cfg=None):
+        if device is None:
+            if torch.backends.mps.is_available():
+                device = "mps"
+            elif torch.cuda.is_available():
+                device = "cuda"
+            else:
+                device = "cpu"
+        self.device = device
+        self.tokenizer = Tokenizer.from_file(tokenizer_path)
+        cfg = cfg or DEFAULT_CFG
+        self.model = NanoSLM(**cfg)
+        ckpt = torch.load(ckpt_path, map_location=device)
+        # support both raw state_dicts and {"model": ...} checkpoints
+        state = ckpt["model"] if isinstance(ckpt, dict) and "model" in ckpt else ckpt
+        self.model.load_state_dict(state)
+        self.model.to(device).eval()
+        self.ctx_len = cfg["ctx_len"]
+    @torch.no_grad()
+    def generate(
+        self,
+        prompt: str,
+        max_new_tokens: int = 200,
+        temperature: float = 0.8,
+        top_k: int | None = 40,
+        top_p: float | None = None,
+        seed: int | None = None,
+    ) -> str:
+        """Generate continuation for a prompt.
+        Args:
+            prompt: input text
+            max_new_tokens: how many tokens to generate
+            temperature: 0 = greedy, 1.0 = no scaling, >1 = more random
+            top_k: keep only the k highest-prob tokens (None = no filter)
+            top_p: nucleus — keep smallest set with cumulative prob >= p
+            seed: for reproducibility
+        """
+        if seed is not None:
+            torch.manual_seed(seed)
+        ids = self.tokenizer.encode(prompt).ids
+        x = torch.tensor([ids], dtype=torch.long, device=self.device)
+        for _ in range(max_new_tokens):
+            # truncate context if it grows past ctx_len
+            x_cond = x[:, -self.ctx_len:]
+            logits, _ = self.model(x_cond)
+            # we only care about the prediction for the next token
+            logits = logits[:, -1, :]
+            if temperature == 0.0:
+                # greedy: pick the argmax
+                next_tok = logits.argmax(dim=-1, keepdim=True)
+            else:
+                logits = logits / temperature
+                if top_k is not None:
+                    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                    logits[logits < v[:, [-1]]] = -float("inf")
+                if top_p is not None:
+                    sorted_logits, sorted_idx = torch.sort(logits, descending=True)
+                    cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    # mask tokens past the nucleus
+                    mask = cum_probs > top_p
+                    # shift right so we always keep at least one token
+                    mask[..., 1:] = mask[..., :-1].clone()
+                    mask[..., 0] = False
+                    sorted_logits[mask] = -float("inf")
+                    # unsort back to original vocab order
+                    logits = torch.zeros_like(logits).scatter_(1, sorted_idx, sorted_logits)
+                probs = F.softmax(logits, dim=-1)
+                next_tok = torch.multinomial(probs, num_samples=1)
+            x = torch.cat([x, next_tok], dim=1)
+        return self.tokenizer.decode(x[0].tolist())
+if __name__ == "__main__":
+    # quick self-test
+    slm = NanoSLMInference("out/ckpt.pt", "data/tokenizer.json")
+    print(slm.generate("Once upon a time", max_new_tokens=100, temperature=0.8, top_k=40))

model.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+Nano-SLM: a tiny decoder-only transformer (~1M params).
+Architecture is intentionally minimal so every line is readable.
+Mirrors the standard GPT recipe: token + position embeddings, N stacked
+(causal self-attention -> MLP) blocks with pre-LayerNorm and residuals,
+final LayerNorm, and a tied LM head.
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class CausalSelfAttention(nn.Module):
+    """Multi-head causal self-attention. Uses fused QKV and PyTorch's SDPA."""
+    def __init__(self, d_model, n_heads, dropout=0.1):
+        super().__init__()
+        assert d_model % n_heads == 0
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        # one big linear that produces Q, K, V at once
+        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
+        self.proj = nn.Linear(d_model, d_model, bias=False)
+        self.attn_dropout_p = dropout
+        self.resid_dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        B, T, C = x.shape
+        q, k, v = self.qkv(x).split(C, dim=-1)
+        # reshape to (B, n_heads, T, head_dim)
+        q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        # Flash/SDPA: causal mask + scaling handled internally
+        y = F.scaled_dot_product_attention(
+            q, k, v,
+            is_causal=True,
+            dropout_p=self.attn_dropout_p if self.training else 0.0,
+        )
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        return self.resid_dropout(self.proj(y))
+class MLP(nn.Module):
+    """Position-wise feed-forward (GELU)."""
+    def __init__(self, d_model, ffn_dim, dropout=0.1):
+        super().__init__()
+        self.fc1 = nn.Linear(d_model, ffn_dim, bias=False)
+        self.fc2 = nn.Linear(ffn_dim, d_model, bias=False)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        return self.dropout(self.fc2(F.gelu(self.fc1(x))))
+class Block(nn.Module):
+    """Pre-LN transformer block: x = x + attn(LN(x)); x = x + mlp(LN(x))."""
+    def __init__(self, d_model, n_heads, ffn_dim, dropout=0.1):
+        super().__init__()
+        self.ln1 = nn.LayerNorm(d_model)
+        self.attn = CausalSelfAttention(d_model, n_heads, dropout)
+        self.ln2 = nn.LayerNorm(d_model)
+        self.mlp = MLP(d_model, ffn_dim, dropout)
+    def forward(self, x):
+        x = x + self.attn(self.ln1(x))
+        x = x + self.mlp(self.ln2(x))
+        return x
+class NanoSLM(nn.Module):
+    def __init__(
+        self,
+        vocab_size=4096,
+        d_model=128,
+        n_heads=4,
+        n_layers=4,
+        ffn_dim=512,
+        ctx_len=256,
+        dropout=0.1,
+    ):
+        super().__init__()
+        self.ctx_len = ctx_len
+        self.tok_emb = nn.Embedding(vocab_size, d_model)
+        self.pos_emb = nn.Embedding(ctx_len, d_model)
+        self.drop = nn.Dropout(dropout)
+        self.blocks = nn.ModuleList(
+            [Block(d_model, n_heads, ffn_dim, dropout) for _ in range(n_layers)]
+        )
+        self.ln_f = nn.LayerNorm(d_model)
+        self.head = nn.Linear(d_model, vocab_size, bias=False)
+        # weight tying: input embedding and output projection share weights.
+        # saves a lot of params at small vocab sizes and usually helps quality.
+        self.head.weight = self.tok_emb.weight
+        self.apply(self._init_weights)
+        # scaled init for residual projections (GPT-2 trick)
+        for name, p in self.named_parameters():
+            if name.endswith("proj.weight") or name.endswith("fc2.weight"):
+                nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layers))
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+            if m.bias is not None:
+                nn.init.zeros_(m.bias)
+        elif isinstance(m, nn.Embedding):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+    def num_params(self, non_embedding=False):
+        n = sum(p.numel() for p in self.parameters())
+        if non_embedding:
+            n -= self.tok_emb.weight.numel()
+            n -= self.pos_emb.weight.numel()
+        return n
+    def forward(self, idx, targets=None):
+        B, T = idx.shape
+        assert T <= self.ctx_len, f"sequence length {T} > ctx_len {self.ctx_len}"
+        pos = torch.arange(T, device=idx.device)
+        x = self.drop(self.tok_emb(idx) + self.pos_emb(pos))
+        for block in self.blocks:
+            x = block(x)
+        x = self.ln_f(x)
+        logits = self.head(x)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-100,
+            )
+        return logits, loss
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        """Autoregressive sampling. Slow on purpose: no KV cache (a great upgrade later)."""
+        for _ in range(max_new_tokens):
+            idx_cond = idx[:, -self.ctx_len:]
+            logits, _ = self(idx_cond)
+            logits = logits[:, -1, :] / temperature
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, [-1]]] = -float("inf")
+            probs = F.softmax(logits, dim=-1)
+            next_tok = torch.multinomial(probs, num_samples=1)
+            idx = torch.cat([idx, next_tok], dim=1)
+        return idx

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff