child-12m

A 12.3M parameter decoder-only transformer trained entirely from scratch on TinyStories. No pretrained checkpoint, no distillation, no transfer learning. Every weight learned from random initialization.

Results

Metric	Value
Parameters (total)	12,256,256
Parameters (non-embedding)	~10.1M (82% of total)
Validation loss (nats/token)	1.2268
Validation perplexity	3.41
Bits-per-byte	0.4249
Bytes per token	4.17
BLiMP accuracy (full, 67 paradigms)	63.4%
Training tokens seen	~1.6 billion
Total training steps	24,919

Bits-per-byte is reported alongside perplexity so results can be compared fairly against models that use different tokenizers.

BLiMP Evaluation (Full, 67 Paradigms)

63.4% overall on the full BLiMP benchmark (67 paradigms, 200 pairs each, 13,400 total evaluations). Within range of BabyLM 2024 Strict-Small baselines (60.6%ΓÇô69.8%) at half the parameter count, despite training exclusively on TinyStories' restricted register.

Reference points:

Model	Params	BLiMP
Random baseline	ΓÇö	50.0%
BabyLM 2024 LTG-BERT 10M	~25M	60.6%
child-12m (this model)	12.3M	63.4%
BabyLM 2024 BabyLlama 10M	~24M	69.8%
GPT-2 Small (web-scale)	124M	~81%

Notable finding: The score distribution is strongly bimodal. Paradigms that test constructions present in TinyStories (agreement, reflexives, basic negation) score 85ΓÇô99%. Paradigms testing constructions absent from TinyStories (complex syntactic islands, NPI scope, distractor agreement) score at or below chance. This is a diagnostic of training distribution, not model capacity ΓÇö the architecture learns what it is shown.

Top paradigms (>85%): Principle A case (99.5%), sentential negation NPI licensing (99.0%), Principle A domain (98.5%), wh-vs-that no gap (95.0%), existential there quantifiers (91.0%), superlative quantifiers (87.5ΓÇô90.0%), anaphor number agreement (88.0%), irregular past participles (83.0ΓÇô88.0%), wh-subject gap long distance (87.0%)

Bottom paradigms (<25%): matrix question NPI (8.0%), only NPI scope (9.5%), wh-vs-that with gap long distance (9.5%), only NPI licensor (19.5%), wh-vs-that with gap (21.0%), coordinate structure left branch (24.5%)

Methodology note: The gap between a hand-crafted grammar probe (100% on 50 pairs testing TinyStories-native constructions) and full BLiMP (63.4%) illustrates why custom eval suites for TinyStories-class models can dramatically overestimate grammatical competence. Future evaluations of TinyStories-trained models should use BLiMP or equivalent standardized benchmarks.

Architecture

A 14-layer Llama-style decoder-only transformer.

Component	Specification
Layers	14
d_model	256
Attention	GQA (4 query heads, 2 KV heads)
Head dimension	64
FFN	SwiGLU, hidden dim 688
Normalization	RMSNorm (pre-norm)
Position encoding	RoPE (theta=10000)
Features	QK-norm, z-loss, weight-tied embeddings
Vocabulary	8,192 ByteLevelBPE
Max context	2,048
Dropout	0.0

The architecture prioritizes depth (14 layers) over width (d_model=256) and uses a compact 8K vocabulary, allocating ~82% of the parameter budget to transformer computation rather than embeddings.

Generation Samples (step 24,919)

Prompt: "Once upon a time, there was a little"

Once upon a time, there was a little girl named Lily. She had a toy bear that she loved very much. One day, Lily went to the park to play with her bear. She saw a boy playing with a ball and asked him...

Prompt: "One day, a girl named Lily found"

One day, a girl named Lily found a mysterious box in her room. She tried to open it, but it was locked. She tried to find the key, but it was locked. Lily was sad because the box was too hard to open.

Prompt: "Mom said to the little boy,"

Mom said to the little boy, "What's wrong? Why are you fighting?" The little boy said he wanted to play. He said, "No, silly. I don't want to share with you."

Prompt: "Papa said"

Papa said, "Let's go, let's go!" They went into the kitchen. It was dark and quiet. Papa said, "Let's make a fire and stay warm." Mama said, "Okay, let's make a fire and stay warm."

Prompt: "She looked at the sky and"

She looked at the sky and saw a big rainbow. It was so big that she could not see the other side of it. She thought it was the most beautiful thing she had ever seen. Suddenly, a voice said, "What do..."

Training

Field	Value
Dataset	roneneldan/TinyStories (2,119,719 stories)
Tokens in dataset	464,662,320
Tokens seen	~1.6 billion (3+ epochs)
Optimizer	AdamW (beta1=0.9, beta2=0.95, wd=0.1)
LR schedule	WSD: warmup + stable at 2e-4 + cosine decay to 1e-5; then constant 5e-5 sessions
Total steps	24,919
Batch size	128 effective (32 x 4 gradient accumulation)
Sequence length	512
Precision	float16 (autocast + GradScaler)
Gradient clip	1.0
Hardware	Kaggle T4 GPU (16 GB), single GPU, free tier

Training History

Checkpoint	Step	Train Loss	Val PPL	BPB
early	4,532	1.597	4.21	ΓÇö
mid	15,387	1.437	3.45	0.4285
final	24,919	1.396	3.41	0.4249

Limitations

Trained exclusively on TinyStories (3ΓÇô4 year old vocabulary level)
BLiMP performance is strongly distribution-dependent: high on TinyStories-native constructions, at-chance on absent ones
Single training run ΓÇö no seed variance reported
Not evaluated on general-purpose benchmarks
Not suitable for any production or safety-critical use

Weights

Weights are not publicly released. This model card documents architecture and results for research reference.

Future Work

Net2Net function-preserving expansion to 50M+ with diverse data (Simple Wikipedia)
Full BLiMP re-evaluation after data diversification to test whether bottom paradigms improve
Distortion-Reactive Plasticity (DRP) experiments
Dialogue fine-tuning and identity training at larger scale

Citation

@misc{child12m2026,
  title  = {child-12m: A 12.3M Parameter Language Model Trained From Scratch on TinyStories},
  author = {Radji, Kraim and Claude (Anthropic Opus 4.6)},
  year   = {2026},
  note   = {Perplexity 3.41, BPB 0.4249, BLiMP 63.4\% (full 67 paradigms). Weights not released.}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train Maouz0/child-12m

Evaluation results

Perplexity on TinyStories Validation
validation set self-reported

3.410
Average Validation Loss on TinyStories Validation
validation set self-reported

1.227