You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NSTS: Narrative Structure in Tiny Stories

Model checkpoints accompanying the paper "Fluency Is Not Coherence: What Small Language Models Actually Learn".

Overview

This repository contains a depth-controlled width series of GPT-Neo models trained from scratch on FK-filtered subsets of the TinyStories corpus, designed to study the dissociation between fluency and narrative coherence in small language models.

The central finding: fluency scales monotonically with model size and training duration; narrative coherence plateaus to a low ceiling (~2.4/5.0) that additional capacity, training, and corpus enrichment all fail to raise. This is not a capacity problem, not a data problem โ€” it is a property of what next-token cross-entropy prediction selects for.

Model Series

All models use the GPT-Neo architecture with 8 layers, 8 heads, varying only hidden dimension (depth-controlled width series).

Size Hidden Non-emb params Total params
1M 64 ~1.4M ~3.6M
5M 144 ~5.3M ~9.2M
10M 256 ~10.7M ~19.3M
28M 480 ~28.4M ~46.5M
33M 512 ~33.4M ~51.0M

Training Conditions

Two FK-filtered subsets of TinyStories, holding vocabulary distribution constant:

  • Condition A: FK grade < 3 โ€” 860K stories, mean sentence length 8.05 words
  • Condition B: FK grade 4โ€“5 โ€” 574K stories, mean sentence length 12.35 words

Checkpoints saved every 100 optimiser steps (17 checkpoints for Cond A, 23 for Cond B).

Checkpoint Structure

nsts_cond{A|B}_{size}/
    checkpoint-100/
    checkpoint-200/
    ...
    checkpoint-1680/   # Cond A final
    checkpoint-2240/   # Cond B final

Note: nsts_condA_5M is named nsts_condA_5M_ep2 (2-epoch run); nsts_condB_5M is named nsts_condB_5M_ep4 (4-epoch run), but these Extended Epoch runs still contain checkpoints at 1680/2240 for direct comparison.

Loading a Checkpoint

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download a specific checkpoint
path = snapshot_download(
    repo_id="Dan44788/NSTS",
    allow_patterns="nsts_condA_10M/checkpoint-1680/*"
)

model = AutoModelForCausalLM.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")
tokenizer = AutoTokenizer.from_pretrained(f"{path}/nsts_condA_10M/checkpoint-1680")

Key Results

  • Fluency and coherence dissociate sharply at ~800โ€“1000 training steps โ€” a transition point that is scale-invariant across all model sizes
  • Best coherence score achieved: 2.43/5.0 ("a story that resolves with gaps") at 28M parameters
  • Seven direct interventions (corpus enrichment, structured narrative injection, causal markers, extended training, prefix prompting) all produced null results on coherence
  • BLiMP probing reveals the same local/non-local split independently: local grammatical constraints scale with width; non-local constraints are flat at chance

Intended Use

These checkpoints are intended as a test-bed for mechanistic interpretability research at tractable scale. The clean behavioural dissociation between fluency and coherence provides a well-characterised target for circuit-level analysis.

Weight trajectories

Figure 1

Citation

Fluency Is Not Coherence: What Small Language Models Actually Learn (v1)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Dan44788/NSTS

Finetuned
(184)
this model

Dataset used to train Dan44788/NSTS