NSTS / README.md

Update README.md

35bf933 verified 1 day ago

3.77 kB

	---
	language: en
	license: mit
	tags:
	- mechanistic-interpretability
	- tiny-stories
	- gpt-neo
	- narrative-coherence
	- research
	datasets:
	- roneneldan/TinyStories
	base_model: EleutherAI/gpt-neo-125m
	---

	# NSTS: Narrative Structure in Tiny Stories

	Model checkpoints accompanying the paper "Fluency Is Not Coherence: What Small Language Models Actually Learn".

	## Overview

	This repository contains a depth-controlled width series of GPT-Neo models trained from scratch on FK-filtered subsets of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) corpus, designed to study the dissociation between fluency and narrative coherence in small language models.

	The central finding: fluency scales monotonically with model size and training duration; narrative coherence plateaus to a low ceiling (~2.4/5.0) that additional capacity, training, and corpus enrichment all fail to raise. This is not a capacity problem, not a data problem — it is a property of what next-token cross-entropy prediction selects for.

	## Model Series

	All models use the GPT-Neo architecture with 8 layers, 8 heads, varying only hidden dimension (depth-controlled width series).

	\| Size \| Hidden \| Non-emb params \| Total params \|
	\|------\|--------\|---------------\|--------------\|
	\| 1M \| 64 \| ~1.4M \| ~3.6M \|
	\| 5M \| 144 \| ~5.3M \| ~9.2M \|
	\| 10M \| 256 \| ~10.7M \| ~19.3M \|
	\| 28M \| 480 \| ~28.4M \| ~46.5M \|
	\| 33M \| 512 \| ~33.4M \| ~51.0M \|

	## Training Conditions

	Two FK-filtered subsets of TinyStories, holding vocabulary distribution constant:

	- Condition A: FK grade < 3 — 860K stories, mean sentence length 8.05 words
	- Condition B: FK grade 4–5 — 574K stories, mean sentence length 12.35 words

	Checkpoints saved every 100 optimiser steps (17 checkpoints for Cond A, 23 for Cond B).

	## Checkpoint Structure

	```
	nsts_cond{A\|B}_{size}/
	checkpoint-100/
	checkpoint-200/
	...
	checkpoint-1680/ # Cond A final
	checkpoint-2240/ # Cond B final
	```

	Note: `nsts_condA_5M` is named `nsts_condA_5M_ep2` (2-epoch run); `nsts_condB_5M` is named `nsts_condB_5M_ep4` (4-epoch run), but these Extended Epoch runs still contain checkpoints at 1680/2240 for direct comparison.

	## Loading a Checkpoint

	```python
	from huggingface_hub import snapshot_download
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Download a specific checkpoint
	path = snapshot_download(
	repo_id="Dan44788/NSTS",
	allow_patterns="nsts_condA_10M/checkpoint-1680/*"
	)

	model = AutoModelForCausalLM.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")
	tokenizer = AutoTokenizer.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")
	```

	## Key Results

	- Fluency and coherence dissociate sharply at ~800–1000 training steps — a transition point that is scale-invariant across all model sizes
	- Best coherence score achieved: 2.43/5.0 ("a story that resolves with gaps") at 28M parameters
	- Seven direct interventions (corpus enrichment, structured narrative injection, causal markers, extended training, prefix prompting) all produced null results on coherence
	- BLiMP probing reveals the same local/non-local split independently: local grammatical constraints scale with width; non-local constraints are flat at chance

	## Intended Use

	These checkpoints are intended as a test-bed for mechanistic interpretability research at tractable scale. The clean behavioural dissociation between fluency and coherence provides a well-characterised target for circuit-level analysis.

	## Weight trajectories
	![Figure 1](s81a_heatmap_layer.png)

	## Citation

	> Fluency Is Not Coherence: What Small Language Models Actually Learn (v1)