--- license: apache-2.0 language: - en library_name: pytorch tags: - causal-lm - pretrained-from-scratch - small-lm - gpt datasets: - roneneldan/TinyStories - roneneldan/TinyStoriesInstruct - wikimedia/wikipedia - nampdn-ai/tiny-textbooks pipeline_tag: text-generation --- # tiny-38m A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation. Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU. ## Quick start ```python import json, sys, torch from pathlib import Path from huggingface_hub import snapshot_download from tokenizers import Tokenizer from safetensors.torch import load_file local = snapshot_download("darthcrawl/tiny-38m") sys.path.insert(0, local) from config import ModelConfig from model import GPT cfg_dict = json.loads((Path(local) / "config.json").read_text()) valid = {f for f in ModelConfig.__dataclass_fields__} cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid}) model = GPT(cfg).eval() model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False) tok = Tokenizer.from_file(f"{local}/tokenizer.json") eot = tok.token_to_id("<|endoftext|>") ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long) out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot) print(tok.decode(out[0].tolist())) ``` `strict=False` is required because tied embeddings (`lm_head.weight = tok_emb.weight`) get stored once. ## Architecture | | | |---|---| | Type | Decoder-only transformer | | Parameters | 37.8M | | Layers | 8 | | Hidden dim | 512 | | Attention heads | 8 | | Context length | 1024 | | Vocab size | 8192 | | Position encoding | RoPE | | Norm | RMSNorm (pre-norm) | | MLP | SwiGLU | | Attention | PyTorch SDPA, causal | | Embedding tying | Yes | ## Training | | | |---|---| | Source mix | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` | | Total train tokens | 477521740 | | Best ckpt step | 19500 | | Best val loss | 1.8847 | | Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) | | Peak LR | 0.0006 | | LR schedule | Cosine, 200-step warmup | | Batch size | 32 × grad_accum 4 | | Precision | bfloat16 (AMP) | | Hardware | Single GPU | Mix format is `name:weight,...`. `meta.txt` in this repo is the canonical record. ## Tokenizer Byte-level BPE trained on the same source mix. Single `tokenizer.json` (HuggingFace `tokenizers` format), 8192 merges. Special tokens: `<|endoftext|>` (eot/eos), `<|pad|>`. ## What it can do - Continue toddler-level English narratives in TinyStories register. - Produce short factual-sounding text in the simple-Wikipedia register. - Follow basic prompt → story patterns from TinyStoriesInstruct. ## What it can't do - General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix. - Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow. - Reliable factuality. Even on simple-wiki-style prompts it will confabulate. ## Intended use Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production. ## Limitations and bias Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF. ## Reproducibility Inference code (`model.py`, `config.py`, `sample.py`) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project. ## License Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).