license: apache-2.0
language:
- en
library_name: pytorch
tags:
- causal-lm
- pretrained-from-scratch
- small-lm
- gpt
datasets:
- roneneldan/TinyStories
- roneneldan/TinyStoriesInstruct
- wikimedia/wikipedia
- nampdn-ai/tiny-textbooks
pipeline_tag: text-generation
tiny-38m
A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.
Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.
Quick start
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file
local = snapshot_download("darthcrawl/tiny-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT
cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})
model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)
tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")
ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
print(tok.decode(out[0].tolist()))
strict=False is required because tied embeddings (lm_head.weight = tok_emb.weight) get stored once.
Architecture
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |
Training
| Source mix | tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10 |
| Total train tokens | 477521740 |
| Best ckpt step | 19500 |
| Best val loss | 1.8847 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| Peak LR | 0.0006 |
| LR schedule | Cosine, 200-step warmup |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |
| Hardware | Single GPU |
Mix format is name:weight,.... meta.txt in this repo is the canonical record.
Tokenizer
Byte-level BPE trained on the same source mix. Single tokenizer.json (HuggingFace tokenizers format), 8192 merges. Special tokens: <|endoftext|> (eot/eos), <|pad|>.
What it can do
- Continue toddler-level English narratives in TinyStories register.
- Produce short factual-sounding text in the simple-Wikipedia register.
- Follow basic prompt → story patterns from TinyStoriesInstruct.
What it can't do
- General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
- Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
- Reliable factuality. Even on simple-wiki-style prompts it will confabulate.
Intended use
Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.
Limitations and bias
Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.
Reproducibility
Inference code (model.py, config.py, sample.py) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.
License
Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).