TinyStories Small Language Model (RoPE + SwiGLU)
A ~30M parameter language model trained from scratch on the TinyStories dataset.
Architecture
| Component | Choice | Why |
|---|---|---|
| Position Encoding | RoPE | Used in LLaMA, Mistral, Gemma — better length generalization |
| Activation | SwiGLU | Used in LLaMA, PaLM — better gradient flow than GeLU |
| Normalization | RMSNorm | Used in LLaMA — faster than LayerNorm |
| Attention | Flash Attention (PyTorch 2.0+) | Memory efficient causal attention |
Parameters
- Total: 29.92M
- Layers: 6 | Heads: 6 | d_model: 384
- Context window: 256 tokens
Training
- Dataset: TinyStories (~2.1M short stories)
- Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
- LR Schedule: Linear warmup + Cosine decay
- Mixed precision: bfloat16
- Best validation loss: 1.6472 | Perplexity: 5.2
Sample Output
Prompt: "Once upon a time there was a little girl named Lily"
[Add sample output after training]
Usage
import torch
import tiktoken
from huggingface_hub import hf_hub_download
# Load weights
weights_path = hf_hub_download(repo_id="Manushi0304/tinystories-slm-rope", filename="pytorch_model.bin")
# Load config and rebuild model, then load state dict
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support