TinyStories Small Language Model (RoPE + SwiGLU)

A ~30M parameter language model trained from scratch on the TinyStories dataset.

Architecture

Component	Choice	Why
Position Encoding	RoPE	Used in LLaMA, Mistral, Gemma — better length generalization
Activation	SwiGLU	Used in LLaMA, PaLM — better gradient flow than GeLU
Normalization	RMSNorm	Used in LLaMA — faster than LayerNorm
Attention	Flash Attention (PyTorch 2.0+)	Memory efficient causal attention

Parameters

Total: 29.92M
Layers: 6 | Heads: 6 | d_model: 384
Context window: 256 tokens

Training

Dataset: TinyStories (~2.1M short stories)
Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
LR Schedule: Linear warmup + Cosine decay
Mixed precision: bfloat16
Best validation loss: 1.6472 | Perplexity: 5.2

Sample Output

Prompt: "Once upon a time there was a little girl named Lily"

[Add sample output after training]

Usage

import torch
import tiktoken
from huggingface_hub import hf_hub_download

# Load weights
weights_path = hf_hub_download(repo_id="Manushi0304/tinystories-slm-rope", filename="pytorch_model.bin")
# Load config and rebuild model, then load state dict

Downloads last month: 17

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support