tiny-38m / README.md

Add files using upload-large-folder tool

6e14144 verified 7 days ago

3.93 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: pytorch
	tags:
	- causal-lm
	- pretrained-from-scratch
	- small-lm
	- gpt
	datasets:
	- roneneldan/TinyStories
	- roneneldan/TinyStoriesInstruct
	- wikimedia/wikipedia
	- nampdn-ai/tiny-textbooks
	pipeline_tag: text-generation
	---

	# tiny-38m

	A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.

	Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.

	## Quick start

	```python
	import json, sys, torch
	from pathlib import Path
	from huggingface_hub import snapshot_download
	from tokenizers import Tokenizer
	from safetensors.torch import load_file

	local = snapshot_download("darthcrawl/tiny-38m")
	sys.path.insert(0, local)
	from config import ModelConfig
	from model import GPT

	cfg_dict = json.loads((Path(local) / "config.json").read_text())
	valid = {f for f in ModelConfig.__dataclass_fields__}
	cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

	model = GPT(cfg).eval()
	model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

	tok = Tokenizer.from_file(f"{local}/tokenizer.json")
	eot = tok.token_to_id("<\|endoftext\|>")

	ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
	out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
	print(tok.decode(out[0].tolist()))
	```

	`strict=False` is required because tied embeddings (`lm_head.weight = tok_emb.weight`) get stored once.

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Type \| Decoder-only transformer \|
	\| Parameters \| 37.8M \|
	\| Layers \| 8 \|
	\| Hidden dim \| 512 \|
	\| Attention heads \| 8 \|
	\| Context length \| 1024 \|
	\| Vocab size \| 8192 \|
	\| Position encoding \| RoPE \|
	\| Norm \| RMSNorm (pre-norm) \|
	\| MLP \| SwiGLU \|
	\| Attention \| PyTorch SDPA, causal \|
	\| Embedding tying \| Yes \|

	## Training

	\| \| \|
	\|---\|---\|
	\| Source mix \| `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` \|
	\| Total train tokens \| 477521740 \|
	\| Best ckpt step \| 19500 \|
	\| Best val loss \| 1.8847 \|
	\| Optimizer \| AdamW (β=(0.9, 0.95), wd=0.1) \|
	\| Peak LR \| 0.0006 \|
	\| LR schedule \| Cosine, 200-step warmup \|
	\| Batch size \| 32 × grad_accum 4 \|
	\| Precision \| bfloat16 (AMP) \|
	\| Hardware \| Single GPU \|

	Mix format is `name:weight,...`. `meta.txt` in this repo is the canonical record.

	## Tokenizer

	Byte-level BPE trained on the same source mix. Single `tokenizer.json` (HuggingFace `tokenizers` format), 8192 merges. Special tokens: `<\|endoftext\|>` (eot/eos), `<\|pad\|>`.

	## What it can do

	- Continue toddler-level English narratives in TinyStories register.
	- Produce short factual-sounding text in the simple-Wikipedia register.
	- Follow basic prompt → story patterns from TinyStoriesInstruct.

	## What it can't do

	- General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
	- Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
	- Reliable factuality. Even on simple-wiki-style prompts it will confabulate.

	## Intended use

	Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.

	## Limitations and bias

	Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.

	## Reproducibility

	Inference code (`model.py`, `config.py`, `sample.py`) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.

	## License

	Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).