| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| tags: |
| - causal-lm |
| - pretrained-from-scratch |
| - small-lm |
| - gpt |
| datasets: |
| - roneneldan/TinyStories |
| - roneneldan/TinyStoriesInstruct |
| - wikimedia/wikipedia |
| - nampdn-ai/tiny-textbooks |
| pipeline_tag: text-generation |
| --- |
| |
| # tiny-38m |
|
|
| A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation. |
|
|
| Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU. |
|
|
| ## Quick start |
|
|
| ```python |
| import json, sys, torch |
| from pathlib import Path |
| from huggingface_hub import snapshot_download |
| from tokenizers import Tokenizer |
| from safetensors.torch import load_file |
| |
| local = snapshot_download("darthcrawl/tiny-38m") |
| sys.path.insert(0, local) |
| from config import ModelConfig |
| from model import GPT |
| |
| cfg_dict = json.loads((Path(local) / "config.json").read_text()) |
| valid = {f for f in ModelConfig.__dataclass_fields__} |
| cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid}) |
| |
| model = GPT(cfg).eval() |
| model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False) |
| |
| tok = Tokenizer.from_file(f"{local}/tokenizer.json") |
| eot = tok.token_to_id("<|endoftext|>") |
| |
| ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long) |
| out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot) |
| print(tok.decode(out[0].tolist())) |
| ``` |
|
|
| `strict=False` is required because tied embeddings (`lm_head.weight = tok_emb.weight`) get stored once. |
|
|
| ## Architecture |
|
|
| | | | |
| |---|---| |
| | Type | Decoder-only transformer | |
| | Parameters | 37.8M | |
| | Layers | 8 | |
| | Hidden dim | 512 | |
| | Attention heads | 8 | |
| | Context length | 1024 | |
| | Vocab size | 8192 | |
| | Position encoding | RoPE | |
| | Norm | RMSNorm (pre-norm) | |
| | MLP | SwiGLU | |
| | Attention | PyTorch SDPA, causal | |
| | Embedding tying | Yes | |
|
|
| ## Training |
|
|
| | | | |
| |---|---| |
| | Source mix | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` | |
| | Total train tokens | 477521740 | |
| | Best ckpt step | 19500 | |
| | Best val loss | 1.8847 | |
| | Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) | |
| | Peak LR | 0.0006 | |
| | LR schedule | Cosine, 200-step warmup | |
| | Batch size | 32 × grad_accum 4 | |
| | Precision | bfloat16 (AMP) | |
| | Hardware | Single GPU | |
| |
| Mix format is `name:weight,...`. `meta.txt` in this repo is the canonical record. |
| |
| ## Tokenizer |
| |
| Byte-level BPE trained on the same source mix. Single `tokenizer.json` (HuggingFace `tokenizers` format), 8192 merges. Special tokens: `<|endoftext|>` (eot/eos), `<|pad|>`. |
| |
| ## What it can do |
| |
| - Continue toddler-level English narratives in TinyStories register. |
| - Produce short factual-sounding text in the simple-Wikipedia register. |
| - Follow basic prompt → story patterns from TinyStoriesInstruct. |
| |
| ## What it can't do |
| |
| - General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix. |
| - Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow. |
| - Reliable factuality. Even on simple-wiki-style prompts it will confabulate. |
| |
| ## Intended use |
| |
| Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production. |
| |
| ## Limitations and bias |
| |
| Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF. |
| |
| ## Reproducibility |
| |
| Inference code (`model.py`, `config.py`, `sample.py`) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project. |
| |
| ## License |
| |
| Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata). |
| |