child-12m
A 12.3M parameter decoder-only transformer trained entirely from scratch on TinyStories. No pretrained checkpoint, no distillation, no transfer learning. Every weight learned from random initialization.
Results
| Metric | Value |
|---|---|
| Parameters (total) | 12,256,256 |
| Parameters (non-embedding) | ~10.1M (82% of total) |
| Validation loss (nats/token) | 1.2268 |
| Validation perplexity | 3.41 |
| Bits-per-byte | 0.4249 |
| Bytes per token | 4.17 |
| BLiMP accuracy (full, 67 paradigms) | 63.4% |
| Training tokens seen | ~1.6 billion |
| Total training steps | 24,919 |
Bits-per-byte is reported alongside perplexity so results can be compared fairly against models that use different tokenizers.
BLiMP Evaluation (Full, 67 Paradigms)
63.4% overall on the full BLiMP benchmark (67 paradigms, 200 pairs each, 13,400 total evaluations). Within range of BabyLM 2024 Strict-Small baselines (60.6%ΓÇô69.8%) at half the parameter count, despite training exclusively on TinyStories' restricted register.
Reference points:
| Model | Params | BLiMP |
|---|---|---|
| Random baseline | ΓÇö | 50.0% |
| BabyLM 2024 LTG-BERT 10M | ~25M | 60.6% |
| child-12m (this model) | 12.3M | 63.4% |
| BabyLM 2024 BabyLlama 10M | ~24M | 69.8% |
| GPT-2 Small (web-scale) | 124M | ~81% |
Notable finding: The score distribution is strongly bimodal. Paradigms that test constructions present in TinyStories (agreement, reflexives, basic negation) score 85ΓÇô99%. Paradigms testing constructions absent from TinyStories (complex syntactic islands, NPI scope, distractor agreement) score at or below chance. This is a diagnostic of training distribution, not model capacity ΓÇö the architecture learns what it is shown.
Top paradigms (>85%): Principle A case (99.5%), sentential negation NPI licensing (99.0%), Principle A domain (98.5%), wh-vs-that no gap (95.0%), existential there quantifiers (91.0%), superlative quantifiers (87.5ΓÇô90.0%), anaphor number agreement (88.0%), irregular past participles (83.0ΓÇô88.0%), wh-subject gap long distance (87.0%)
Bottom paradigms (<25%): matrix question NPI (8.0%), only NPI scope (9.5%), wh-vs-that with gap long distance (9.5%), only NPI licensor (19.5%), wh-vs-that with gap (21.0%), coordinate structure left branch (24.5%)
Methodology note: The gap between a hand-crafted grammar probe (100% on 50 pairs testing TinyStories-native constructions) and full BLiMP (63.4%) illustrates why custom eval suites for TinyStories-class models can dramatically overestimate grammatical competence. Future evaluations of TinyStories-trained models should use BLiMP or equivalent standardized benchmarks.
Architecture
A 14-layer Llama-style decoder-only transformer.
| Component | Specification |
|---|---|
| Layers | 14 |
| d_model | 256 |
| Attention | GQA (4 query heads, 2 KV heads) |
| Head dimension | 64 |
| FFN | SwiGLU, hidden dim 688 |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | RoPE (theta=10000) |
| Features | QK-norm, z-loss, weight-tied embeddings |
| Vocabulary | 8,192 ByteLevelBPE |
| Max context | 2,048 |
| Dropout | 0.0 |
The architecture prioritizes depth (14 layers) over width (d_model=256) and uses a compact 8K vocabulary, allocating ~82% of the parameter budget to transformer computation rather than embeddings.
Generation Samples (step 24,919)
Prompt: "Once upon a time, there was a little"
Once upon a time, there was a little girl named Lily. She had a toy bear that she loved very much. One day, Lily went to the park to play with her bear. She saw a boy playing with a ball and asked him...
Prompt: "One day, a girl named Lily found"
One day, a girl named Lily found a mysterious box in her room. She tried to open it, but it was locked. She tried to find the key, but it was locked. Lily was sad because the box was too hard to open.
Prompt: "Mom said to the little boy,"
Mom said to the little boy, "What's wrong? Why are you fighting?" The little boy said he wanted to play. He said, "No, silly. I don't want to share with you."
Prompt: "Papa said"
Papa said, "Let's go, let's go!" They went into the kitchen. It was dark and quiet. Papa said, "Let's make a fire and stay warm." Mama said, "Okay, let's make a fire and stay warm."
Prompt: "She looked at the sky and"
She looked at the sky and saw a big rainbow. It was so big that she could not see the other side of it. She thought it was the most beautiful thing she had ever seen. Suddenly, a voice said, "What do..."
Training
| Field | Value |
|---|---|
| Dataset | roneneldan/TinyStories (2,119,719 stories) |
| Tokens in dataset | 464,662,320 |
| Tokens seen | ~1.6 billion (3+ epochs) |
| Optimizer | AdamW (beta1=0.9, beta2=0.95, wd=0.1) |
| LR schedule | WSD: warmup + stable at 2e-4 + cosine decay to 1e-5; then constant 5e-5 sessions |
| Total steps | 24,919 |
| Batch size | 128 effective (32 x 4 gradient accumulation) |
| Sequence length | 512 |
| Precision | float16 (autocast + GradScaler) |
| Gradient clip | 1.0 |
| Hardware | Kaggle T4 GPU (16 GB), single GPU, free tier |
Training History
| Checkpoint | Step | Train Loss | Val PPL | BPB |
|---|---|---|---|---|
| early | 4,532 | 1.597 | 4.21 | ΓÇö |
| mid | 15,387 | 1.437 | 3.45 | 0.4285 |
| final | 24,919 | 1.396 | 3.41 | 0.4249 |
Limitations
- Trained exclusively on TinyStories (3ΓÇô4 year old vocabulary level)
- BLiMP performance is strongly distribution-dependent: high on TinyStories-native constructions, at-chance on absent ones
- Single training run ΓÇö no seed variance reported
- Not evaluated on general-purpose benchmarks
- Not suitable for any production or safety-critical use
Weights
Weights are not publicly released. This model card documents architecture and results for research reference.
Future Work
- Net2Net function-preserving expansion to 50M+ with diverse data (Simple Wikipedia)
- Full BLiMP re-evaluation after data diversification to test whether bottom paradigms improve
- Distortion-Reactive Plasticity (DRP) experiments
- Dialogue fine-tuning and identity training at larger scale
Citation
@misc{child12m2026,
title = {child-12m: A 12.3M Parameter Language Model Trained From Scratch on TinyStories},
author = {Radji, Kraim and Claude (Anthropic Opus 4.6)},
year = {2026},
note = {Perplexity 3.41, BPB 0.4249, BLiMP 63.4\% (full 67 paradigms). Weights not released.}
}
Dataset used to train Maouz0/child-12m
Evaluation results
- Perplexity on TinyStories Validationvalidation set self-reported3.410
- Average Validation Loss on TinyStories Validationvalidation set self-reported1.227