File size: 3,931 Bytes
6e14144 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
license: apache-2.0
language:
- en
library_name: pytorch
tags:
- causal-lm
- pretrained-from-scratch
- small-lm
- gpt
datasets:
- roneneldan/TinyStories
- roneneldan/TinyStoriesInstruct
- wikimedia/wikipedia
- nampdn-ai/tiny-textbooks
pipeline_tag: text-generation
---
# tiny-38m
A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.
Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.
## Quick start
```python
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file
local = snapshot_download("darthcrawl/tiny-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT
cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})
model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)
tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")
ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
print(tok.decode(out[0].tolist()))
```
`strict=False` is required because tied embeddings (`lm_head.weight = tok_emb.weight`) get stored once.
## Architecture
| | |
|---|---|
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |
## Training
| | |
|---|---|
| Source mix | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` |
| Total train tokens | 477521740 |
| Best ckpt step | 19500 |
| Best val loss | 1.8847 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| Peak LR | 0.0006 |
| LR schedule | Cosine, 200-step warmup |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |
| Hardware | Single GPU |
Mix format is `name:weight,...`. `meta.txt` in this repo is the canonical record.
## Tokenizer
Byte-level BPE trained on the same source mix. Single `tokenizer.json` (HuggingFace `tokenizers` format), 8192 merges. Special tokens: `<|endoftext|>` (eot/eos), `<|pad|>`.
## What it can do
- Continue toddler-level English narratives in TinyStories register.
- Produce short factual-sounding text in the simple-Wikipedia register.
- Follow basic prompt → story patterns from TinyStoriesInstruct.
## What it can't do
- General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
- Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
- Reliable factuality. Even on simple-wiki-style prompts it will confabulate.
## Intended use
Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.
## Limitations and bias
Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.
## Reproducibility
Inference code (`model.py`, `config.py`, `sample.py`) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.
## License
Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).
|