NSTS: Narrative Structure in Tiny Stories
Model checkpoints accompanying the paper "Fluency Is Not Coherence: What Small Language Models Actually Learn".
Overview
This repository contains a depth-controlled width series of GPT-Neo models trained from scratch on FK-filtered subsets of the TinyStories corpus, designed to study the dissociation between fluency and narrative coherence in small language models.
The central finding: fluency scales monotonically with model size and training duration; narrative coherence plateaus to a low ceiling (~2.4/5.0) that additional capacity, training, and corpus enrichment all fail to raise. This is not a capacity problem, not a data problem โ it is a property of what next-token cross-entropy prediction selects for.
Model Series
All models use the GPT-Neo architecture with 8 layers, 8 heads, varying only hidden dimension (depth-controlled width series).
| Size | Hidden | Non-emb params | Total params |
|---|---|---|---|
| 1M | 64 | ~1.4M | ~3.6M |
| 5M | 144 | ~5.3M | ~9.2M |
| 10M | 256 | ~10.7M | ~19.3M |
| 28M | 480 | ~28.4M | ~46.5M |
| 33M | 512 | ~33.4M | ~51.0M |
Training Conditions
Two FK-filtered subsets of TinyStories, holding vocabulary distribution constant:
- Condition A: FK grade < 3 โ 860K stories, mean sentence length 8.05 words
- Condition B: FK grade 4โ5 โ 574K stories, mean sentence length 12.35 words
Checkpoints saved every 100 optimiser steps (17 checkpoints for Cond A, 23 for Cond B).
Checkpoint Structure
nsts_cond{A|B}_{size}/
checkpoint-100/
checkpoint-200/
...
checkpoint-1680/ # Cond A final
checkpoint-2240/ # Cond B final
Note: nsts_condA_5M is named nsts_condA_5M_ep2 (2-epoch run); nsts_condB_5M is named nsts_condB_5M_ep4 (4-epoch run), but these Extended Epoch runs still contain checkpoints at 1680/2240 for direct comparison.
Loading a Checkpoint
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
# Download a specific checkpoint
path = snapshot_download(
repo_id="Dan44788/NSTS",
allow_patterns="nsts_condA_10M/checkpoint-1680/*"
)
model = AutoModelForCausalLM.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")
tokenizer = AutoTokenizer.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")
Key Results
- Fluency and coherence dissociate sharply at ~800โ1000 training steps โ a transition point that is scale-invariant across all model sizes
- Best coherence score achieved: 2.43/5.0 ("a story that resolves with gaps") at 28M parameters
- Seven direct interventions (corpus enrichment, structured narrative injection, causal markers, extended training, prefix prompting) all produced null results on coherence
- BLiMP probing reveals the same local/non-local split independently: local grammatical constraints scale with width; non-local constraints are flat at chance
Intended Use
These checkpoints are intended as a test-bed for mechanistic interpretability research at tractable scale. The clean behavioural dissociation between fluency and coherence provides a well-characterised target for circuit-level analysis.
Weight trajectories
Citation
Fluency Is Not Coherence: What Small Language Models Actually Learn (v1)
Model tree for Dan44788/NSTS
Base model
EleutherAI/gpt-neo-125m