tiny-38m / README.md

Add files using upload-large-folder tool

6e14144 verified 6 days ago

3.93 kB

license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - pretrained-from-scratch
  - small-lm
  - gpt
datasets:
  - roneneldan/TinyStories
  - roneneldan/TinyStoriesInstruct
  - wikimedia/wikipedia
  - nampdn-ai/tiny-textbooks
pipeline_tag: text-generation

tiny-38m

A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.

Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.

Quick start

import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file

local = snapshot_download("darthcrawl/tiny-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT

cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")

ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
print(tok.decode(out[0].tolist()))

strict=False is required because tied embeddings (lm_head.weight = tok_emb.weight) get stored once.

Architecture


Type	Decoder-only transformer
Parameters	37.8M
Layers	8
Hidden dim	512
Attention heads	8
Context length	1024
Vocab size	8192
Position encoding	RoPE
Norm	RMSNorm (pre-norm)
MLP	SwiGLU
Attention	PyTorch SDPA, causal
Embedding tying	Yes

Training


Source mix	`tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10`
Total train tokens	477521740
Best ckpt step	19500
Best val loss	1.8847
Optimizer	AdamW (β=(0.9, 0.95), wd=0.1)
Peak LR	0.0006
LR schedule	Cosine, 200-step warmup
Batch size	32 × grad_accum 4
Precision	bfloat16 (AMP)
Hardware	Single GPU

Mix format is name:weight,.... meta.txt in this repo is the canonical record.

Tokenizer

Byte-level BPE trained on the same source mix. Single tokenizer.json (HuggingFace tokenizers format), 8192 merges. Special tokens: <|endoftext|> (eot/eos), <|pad|>.

What it can do

Continue toddler-level English narratives in TinyStories register.
Produce short factual-sounding text in the simple-Wikipedia register.
Follow basic prompt → story patterns from TinyStoriesInstruct.

What it can't do

General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
Reliable factuality. Even on simple-wiki-style prompts it will confabulate.

Intended use

Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.

Limitations and bias

Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.

Reproducibility

Inference code (model.py, config.py, sample.py) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.

License

Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).