File size: 3,931 Bytes

6e14144

---
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - pretrained-from-scratch
  - small-lm
  - gpt
datasets:
  - roneneldan/TinyStories
  - roneneldan/TinyStoriesInstruct
  - wikimedia/wikipedia
  - nampdn-ai/tiny-textbooks
pipeline_tag: text-generation
---

# tiny-38m

A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.

Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.

## Quick start

```python
import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file

local = snapshot_download("darthcrawl/tiny-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT

cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")

ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
print(tok.decode(out[0].tolist()))
```

`strict=False` is required because tied embeddings (`lm_head.weight = tok_emb.weight`) get stored once.

## Architecture

| | |
|---|---|
| Type | Decoder-only transformer |
| Parameters | 37.8M |
| Layers | 8 |
| Hidden dim | 512 |
| Attention heads | 8 |
| Context length | 1024 |
| Vocab size | 8192 |
| Position encoding | RoPE |
| Norm | RMSNorm (pre-norm) |
| MLP | SwiGLU |
| Attention | PyTorch SDPA, causal |
| Embedding tying | Yes |

## Training

| | |
|---|---|
| Source mix | `tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10` |
| Total train tokens | 477521740 |
| Best ckpt step | 19500 |
| Best val loss | 1.8847 |
| Optimizer | AdamW (β=(0.9, 0.95), wd=0.1) |
| Peak LR | 0.0006 |
| LR schedule | Cosine, 200-step warmup |
| Batch size | 32 × grad_accum 4 |
| Precision | bfloat16 (AMP) |
| Hardware | Single GPU |

Mix format is `name:weight,...`. `meta.txt` in this repo is the canonical record.

## Tokenizer

Byte-level BPE trained on the same source mix. Single `tokenizer.json` (HuggingFace `tokenizers` format), 8192 merges. Special tokens: `<|endoftext|>` (eot/eos), `<|pad|>`.

## What it can do

- Continue toddler-level English narratives in TinyStories register.
- Produce short factual-sounding text in the simple-Wikipedia register.
- Follow basic prompt → story patterns from TinyStoriesInstruct.

## What it can't do

- General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
- Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
- Reliable factuality. Even on simple-wiki-style prompts it will confabulate.

## Intended use

Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.

## Limitations and bias

Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.

## Reproducibility

Inference code (`model.py`, `config.py`, `sample.py`) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.

## License

Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).