| --- |
| language: en |
| license: mit |
| tags: |
| - mechanistic-interpretability |
| - tiny-stories |
| - gpt-neo |
| - narrative-coherence |
| - research |
| datasets: |
| - roneneldan/TinyStories |
| base_model: EleutherAI/gpt-neo-125m |
| --- |
| |
| # NSTS: Narrative Structure in Tiny Stories |
|
|
| Model checkpoints accompanying the paper **"Fluency Is Not Coherence: What Small Language Models Actually Learn"**. |
|
|
| ## Overview |
|
|
| This repository contains a depth-controlled width series of GPT-Neo models trained from scratch on FK-filtered subsets of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) corpus, designed to study the dissociation between fluency and narrative coherence in small language models. |
|
|
| The central finding: fluency scales monotonically with model size and training duration; narrative coherence plateaus to a low ceiling (~2.4/5.0) that additional capacity, training, and corpus enrichment all fail to raise. This is not a capacity problem, not a data problem β it is a property of what next-token cross-entropy prediction selects for. |
|
|
| ## Model Series |
|
|
| All models use the GPT-Neo architecture with **8 layers, 8 heads**, varying only hidden dimension (depth-controlled width series). |
|
|
| | Size | Hidden | Non-emb params | Total params | |
| |------|--------|---------------|--------------| |
| | 1M | 64 | ~1.4M | ~3.6M | |
| | 5M | 144 | ~5.3M | ~9.2M | |
| | 10M | 256 | ~10.7M | ~19.3M | |
| | 28M | 480 | ~28.4M | ~46.5M | |
| | 33M | 512 | ~33.4M | ~51.0M | |
|
|
| ## Training Conditions |
|
|
| Two FK-filtered subsets of TinyStories, holding vocabulary distribution constant: |
|
|
| - **Condition A**: FK grade < 3 β 860K stories, mean sentence length 8.05 words |
| - **Condition B**: FK grade 4β5 β 574K stories, mean sentence length 12.35 words |
|
|
| Checkpoints saved every 100 optimiser steps (17 checkpoints for Cond A, 23 for Cond B). |
|
|
| ## Checkpoint Structure |
|
|
| ``` |
| nsts_cond{A|B}_{size}/ |
| checkpoint-100/ |
| checkpoint-200/ |
| ... |
| checkpoint-1680/ # Cond A final |
| checkpoint-2240/ # Cond B final |
| ``` |
|
|
| Note: `nsts_condA_5M` is named `nsts_condA_5M_ep2` (2-epoch run); `nsts_condB_5M` is named `nsts_condB_5M_ep4` (4-epoch run), but these Extended Epoch runs still contain checkpoints at 1680/2240 for direct comparison. |
|
|
| ## Loading a Checkpoint |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| # Download a specific checkpoint |
| path = snapshot_download( |
| repo_id="Dan44788/NSTS", |
| allow_patterns="nsts_condA_10M/checkpoint-1680/*" |
| ) |
| |
| model = AutoModelForCausalLM.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680") |
| tokenizer = AutoTokenizer.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680") |
| ``` |
|
|
| ## Key Results |
|
|
| - Fluency and coherence **dissociate sharply** at ~800β1000 training steps β a transition point that is scale-invariant across all model sizes |
| - Best coherence score achieved: **2.43/5.0** ("a story that resolves with gaps") at 28M parameters |
| - Seven direct interventions (corpus enrichment, structured narrative injection, causal markers, extended training, prefix prompting) all produced null results on coherence |
| - BLiMP probing reveals the same local/non-local split independently: local grammatical constraints scale with width; non-local constraints are flat at chance |
|
|
| ## Intended Use |
|
|
| These checkpoints are intended as a test-bed for **mechanistic interpretability** research at tractable scale. The clean behavioural dissociation between fluency and coherence provides a well-characterised target for circuit-level analysis. |
|
|
| ## Weight trajectories |
|  |
|
|
| ## Citation |
|
|
| > *Fluency Is Not Coherence: What Small Language Models Actually Learn* (v1) |
|
|