nanoGPT SLM -- 123.8M Parameter Children's Story Generator

A small language model trained entirely from scratch using a custom nanoGPT (GPT-2 small) implementation. Pretrained on the TinyStories dataset to generate short, coherent stories for young children.

What This Model Does

This model generates short children's stories suitable for 3-5 year olds. Give it the beginning of a story and it will continue writing in simple, age-appropriate language:

Input:  "Once upon a time there was a little rabbit"
Output: "Once upon a time there was a little rabbit who lived in a big forest.
         The rabbit loved to hop and play with his friends. One day, he
         found a shiny red ball near the river..."

Capabilities:

Generates coherent short stories (100-200 words)
Uses simple vocabulary appropriate for young children
Follows common story patterns (characters, conflict, resolution)
Understands basic narrative structure (beginning, middle, end)
Can continue from any story opening/prompt

Limitations:

Stories are short (max 256 token context window)
Limited to simple vocabulary and narrative structures
No instruction-following ability (see fine-tuned variants below)
May occasionally generate repetitive or nonsensical text

Training Dataset: TinyStories

Attribute	Value
Dataset	TinyStories (Eldan & Li, 2023)
Description	Synthetic short stories generated by GPT-3.5/GPT-4, filtered for quality
Target audience	Children aged 3-5 years
Vocabulary	Words that a typical 3-4 year old would understand
Training stories	~2,119,719
Validation stories	~21,990
Total tokens	~470M
Average story length	~220 tokens
Topics	Animals, friendship, family, nature, adventure, sharing, kindness

The TinyStories dataset was specifically designed to study whether small language models can learn coherent language generation when trained on high-quality, simple text.

Quick Start

Option 1: Run directly (downloads model + generates sample stories - predefined prompts)

# Download `nanogpt_slm_pretrained_inference.py` in working directory

!pip install torch tiktoken huggingface_hub
!python nanogpt_slm_pretrained_inference.py

Option 2: Import and use in your own code to generate your own desired children short stories

# !pip install torch tiktoken huggingface_hub
# Download `nanogpt_slm_pretrained_inference.py` in working directory

# Method 1
from nanogpt_slm_pretrained_inference import tell_story, ask, generate_text

# Generate a children's story
# story = tell_story("Once upon a time there was a little kitten")
story = tell_story(input("Enter a story prompt (e.g., 'Once upon a time there was a little kitten'): ").strip())
print(story)
print("--------------------")

# Method 2
# Simple text completion
# print(ask("The friendly dragon lived in"))
print(ask(input("Enter a prompt for text completion (e.g., 'The friendly dragon lived in'): ").strip()))
print("--------------------")

# Method 3
# Fine-grained control
print(generate_text(
    "A girl named Lily went to the park",  ## Manually add your desired prompt here
    max_tokens=150,     # story length (Max=200)
    temperature=0.8,    # 0.01=predictable, 0.8=balanced, 1.5=creative
    top_k=40            # sampling diversity
))
print("--------------------")

Load weights manually and visualize the model architecture

from huggingface_hub import hf_hub_download
import torch

model_path = hf_hub_download(
    repo_id="nishantup/nanogpt-slm-tinystories-124m",
    filename="nanogpt_slm_tinystories_best.pth"
)

from nanogpt_slm_pretrained_inference import GPT, GPTKV, GPTConfig

config = GPTConfig()
model = GPTKV(config)  # KV-cache enabled for fast generation
model.load_state_dict(torch.load(model_path, map_location="cpu"))
model.eval()

Model Architecture

Attribute	Value
Architecture	nanoGPT (GPT-2 small, 12 layers, 12 heads, 768 dim)
Parameters	123.8M (unique, with weight tying)
Context length	256 tokens
Tokenizer	tiktoken GPT-2 BPE (50,257 tokens)
Weight tying	Yes (token embeddings = LM head)
Attention	Flash Attention when available, causal mask
Normalization	Pre-norm (LayerNorm before attention/MLP)
Activation	GELU
KV Cache	GPTKV variant included for O(1) per-token decode

Training Details

Attribute	Value
Hardware	Google Colab Pro (NVIDIA A100 40GB)
Iterations	22,900
Effective batch size	256 sequences (64 x 4 grad accum)
Tokens per step	65,536 (256 x 256)
Total tokens seen	~~375M (~~0.8 epochs)
Optimizer	AdamW (lr=6e-4, betas=(0.9, 0.95), wd=0.1)
LR schedule	Linear warmup (2000 steps) + cosine decay to 6e-5
Precision	bfloat16 (A100)
Gradient clipping	max_norm=1.0

Files

File	Description
`nanogpt_slm_tinystories_best.pth`	Pretrained model weights (best validation loss)
`nanogpt_slm_pretrained_inference.py`	Standalone inference script with KV cache
`config.json`	Model configuration and training details

API Reference

`tell_story(beginning, max_tokens=250, temperature=0.8, top_k=40)`

Generate a children's story from an opening line. Best for creative story generation.

`ask(prompt, max_tokens=200, temperature=0.8, top_k=40)`

General text completion. Alias for generate_text().

`generate_text(prompt, max_tokens=200, temperature=0.8, top_k=40)`

Low-level text generation with full parameter control.

Parameter	Default	Description
`prompt` / `beginning`	(required)	Text to continue from
`max_tokens`	`200` / `250`	Maximum tokens to generate
`temperature`	`0.8`	0.01 = predictable, 0.8 = balanced, 1.5 = wild
`top_k`	`40`	Top-k filtering (None = no filtering)

Example Outputs

Prompt: "Once upon a time there was a little bear"

Once upon a time there was a little bear who lived in a big forest. The bear loved to play with his friends. One sunny day, he went for a walk and found a beautiful flower. He picked it up and brought it home to show his mama...

Prompt: "The princess looked out her window and saw"

The princess looked out her window and saw a big rainbow in the sky. She was so happy! She ran outside to get a closer look. A little bird flew down and sat on her hand. "Hello!" said the princess...

Fine-tuned Variants

Variant	Type	Repo
This model	Pretrained (TinyStories)	nishantup/nanogpt-slm-tinystories-124m
Instruction-tuned (nanoGPT)	SFT	nishantup/nanogpt-slm-instruct
Spam classifier (nanoGPT)	Classification	nishantup/nanogpt-slm-classifier
Instruction-tuned (Raschka)	SFT	nishantup/gpt2-slm-instruct

Citation

If you use this model, please cite the TinyStories paper:

Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be
and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.

Notes

Trained completely from scratch (no pretrained initialization)
Uses KV cache (GPTKV) for O(1) per-token decode during inference
Weight tying between token embeddings (wte) and LM head (lm_head)
Architecture follows Karpathy's nanoGPT implementation

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nishantup/nanogpt-slm-tinystories-124m

Paper for nishantup/nanogpt-slm-tinystories-124m

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 44