tinygpt / README.md
avi0807
Deploy TinyGPT to Spaces
00e9e05
|
Raw
History Blame Contribute Delete
15.7 kB
metadata
title: TinyGPT
emoji: πŸ“–
colorFrom: red
colorTo: pink
sdk: docker
app_port: 7860
pinned: false

TinyGPT

A small GPT-style language model built and trained from scratch to write short children's stories. Implemented in TensorFlow/Keras with a custom transformer (RoPE attention, weight tying, KV-cache), a byte-level BPE tokenizer, and a FastAPI inference server backed by PostgreSQL.


Table of Contents


Overview

TinyGPT is a decoder-only transformer trained from scratch, without pretrained weights. Every core component β€” the tokenizer, attention with rotary embeddings, the training loop, the sampling code, and the KV-cache used for fast generation β€” is implemented directly.

The model is trained on the TinyStories dataset and learns to generate coherent, grammatically correct short stories from a prompt. The narrow, uniform story domain is what lets a model this small produce genuinely readable output.

Parameters:   ~50M
Architecture: GPT-style decoder-only transformer (GPT-3 inspired)
Positional:   RoPE (rotary position embeddings)
Tokenizer:    Byte-level BPE (HuggingFace tokenizers), 10k vocab
Framework:    TensorFlow 2.x / Keras (mixed bfloat16)
Dataset:      TinyStoriesV2 (noanabeshima/TinyStoriesV2)

Model Architecture

A decoder-only transformer following GPT-2/3 design, with RoPE instead of learned absolute positions.

Hyperparameters

Parameter Value
d_model 640
num_heads 10 (head_dim = 64)
dff (feed-forward dim) 2560
num_layers 10
seq_len 256
dropout_rate 0.1
vocab_size 10000

Components

Token Embedding

  • Learned embedding matrix of shape (vocab_size, d_model), initialized N(0, 0.02).
  • Position is handled by RoPE inside attention, so there is no separate positional embedding table.

Transformer Blocks (Γ—10) β€” Pre-LayerNorm (GPT-3 style):

x β†’ LayerNorm β†’ MultiHeadAttention β†’ x + residual
  β†’ LayerNorm β†’ FFN (GELU)          β†’ x + residual

Pre-LN keeps the residual stream clean and stabilizes deep training.

Multi-Head Attention

  • 10 heads, head_dim = 64.
  • Causal masking; RoPE applied to Q and K before the attention scores.
  • Attention-probability dropout (GPT-3 style).
  • KV-cache for fast autoregressive generation.

Feed-Forward Network

  • d_model β†’ dff β†’ d_model, GELU activation.

Initialization (GPT-2/3 scheme)

  • Weights N(0, 0.02); biases zero; LayerNorm Ξ³=1, Ξ²=0.
  • Residual output projections (attention dense and FFN dense2) scaled by 0.02 / sqrt(2 Β· num_layers) for stability.

Weight Tying

  • The output projection reuses the token-embedding matrix, saving vocab_size Γ— d_model parameters.

Final LayerNorm before the output projection; logits are cast to float32.

Parameter Count (~50M)

With d_model=640, dff=2560, num_layers=10, vocab=10000:

Component Params
Token embedding (tied) 6.4M
10 Γ— transformer block (~4.9M each) ~49M
Final LayerNorm ~0.001M
Total ~50.5M

Per block β‰ˆ 12 Β· d_modelΒ² (4Β·dΒ² attention + 8Β·dΒ² FFN) plus biases and LayerNorm.


Tokenizer

A byte-level BPE tokenizer (HuggingFace tokenizers, Rust-backed) with a 10k vocabulary, trained on a sample of TinyStories.

  • Byte-level: any unicode/whitespace round-trips cleanly; no KeyError on unseen characters.
  • Special tokens: <|endoftext|> (id 0, EOS) and <|unk|> (id 1), reserved at the lowest ids.
  • Fast: encoding the corpus is fast enough that CPU tokenization never starves the GPU during training.
  • Persistence: saved to saved_models/tinystories_tokenizer.json.

A from-scratch pure-Python BPE (BPE_tokenizer) and a CharTokenizer also live in tokenizer.py for reference; the trained model uses HFTokenizer.


Positional Encoding β€” RoPE

TinyGPT uses Rotary Positional Embeddings (RoPE) instead of learned or sinusoidal absolute encodings.

RoPE rotates Q and K by an angle proportional to token position:

q_rotated = q Β· cos(mΞΈ) + rotate_half(q) Β· sin(mΞΈ)
k_rotated = k Β· cos(mΞΈ) + rotate_half(k) Β· sin(mΞΈ)

When q Β· k is computed, the rotations combine to encode the relative distance between tokens. The sin/cos tables are precomputed once for max_len. During cached generation, each new token is rotated at its correct absolute position via a position offset equal to the current cache length.

Advantages: relative position directly in the attention dot product, better length behavior, no extra parameters. Used in LLaMA, Mistral, Qwen, and most modern LLMs.


Training

Optimizer

Adam with GPT-3 hyperparameters:

Parameter Value
beta_1 0.9
beta_2 0.95
epsilon 1e-8
clipnorm 1.0

Learning Rate β€” Warmup + Cosine Decay

Linear warmup over the first 10% of updates to peak_lr = 3e-4, then cosine decay to 0.1 Γ— peak_lr. The schedule is driven by the number of optimizer updates, not micro-batches.

Mixed Precision (bfloat16)

mixed_bfloat16 is used. bf16 has the same exponent range as float32, so it does not require loss scaling (unlike float16). This avoids the LossScaleOptimizer machinery entirely and is stable on Ampere GPUs (RTX 3050+).

Gradient Accumulation

To fit the model on a 6GB GPU while keeping a useful effective batch size:

micro_batch_size = 4
accum_steps      = 4
effective_batch  = 16

Gradients are summed over accum_steps micro-batches, then applied as a single update.

Streaming Data Pipeline

Data is streamed and tokenized on the fly (no giant pre-tokenized array on disk):

HF streaming dataset β†’ text docs β†’ BPE tokens + <|endoftext|> boundaries
  β†’ packed into seq_len chunks β†’ tf.data β†’ batched β†’ prefetched

The streamer retries on transient network errors and can loop the dataset to meet a token budget larger than the dataset itself.

Checkpointing & Resume

  • Validation loss is checked every 1000 updates on an in-memory held-out set; best weights are saved to saved_models/tinystories_model.weights.h5.
  • A tf.train.Checkpoint (model + optimizer + step counter) is saved to saved_models/ckpt_tinystories/, so a crashed run resumes from the last checkpoint.

Dataset

TinyStoriesV2 β€” short children's stories generated by GPT-4, designed for training and evaluating small language models.

Property Value
Source noanabeshima/TinyStoriesV2
Stories ~2.75M
Approx tokens ~520–580M
Sequence length 256

Why TinyStories?

Small, uniform vocabulary; clear narrative structure; short sentences. A ~50M model can actually master this narrow distribution, which is why the output is coherent. Trained on broad web text instead, a model this size produces fluent but meaningless text β€” the narrow domain is the point.


Inference & Sampling

Three sampling strategies are implemented:

  • Temperature β€” divides logits before softmax. Lower = safer/more repetitive; higher = more random/creative.
  • Top-K β€” keep only the K highest-probability tokens.
  • Top-P (nucleus) β€” keep the smallest set of tokens whose cumulative probability exceeds p.

Generation stops early when the model emits <|endoftext|>.


KV-Cache

Autoregressive generation uses a KV-cache so each new token reuses the Keys/Values computed for all previous tokens instead of recomputing the whole sequence.

Flow:

  1. Prefill: the prompt is processed once; each layer returns its K/V (rotated by RoPE at absolute positions).
  2. Decode: each new token is processed alone; its Q/K is rotated at offset = current_cache_length, concatenated onto the cached K/V, and attends over the full history.
  3. Caches are passed in and returned from each call (not mutated in place), so they persist correctly across steps.

This makes generation O(n) per step instead of O(nΒ²), and cached vs full-recompute outputs match exactly.


API Server

FastAPI server exposing the model as a REST API.

Endpoints

Method Route Description
GET / Serves the frontend UI
GET /health Model status, parameter count, vocab size
GET /creativity-levels The named creativity presets with explanations
POST /generate Generate a story from a prompt
GET /history Retrieve past generations
DELETE /history/{id} Delete a stored generation

Generate Request

{
  "prompt": "Once upon a time there was a little dragon",
  "creativity": "balanced",
  "max_new_tokens": 200
}

creativity is one of predictable | balanced | creative | wild (see below). Advanced callers may instead pass raw temperature and top_p, which override the preset.

Generate Response

{
  "prompt": "Once upon a time there was a little dragon",
  "generated_text": "Once upon a time there was a little dragon ...",
  "creativity": "balanced",
  "creativity_description": "A good mix of sense and surprise ...",
  "temperature": 0.8,
  "top_p": 0.9,
  "max_new_tokens": 200,
  "response_time_ms": 812.4
}

Creativity Levels

Rather than asking users to guess what "temperature" means, the API exposes named levels, each with a plain-language description of how it changes the story. Fetch them from /creativity-levels:

Level temperature top_p What it does to your stories
Predictable 0.6 0.85 Safe and focused. Simple, calm, easy-to-follow tales.
Balanced (default) 0.8 0.9 A good mix of sense and surprise. Recommended for most prompts.
Creative 1.0 0.95 More imaginative and varied, with the odd unexpected twist.
Wild 1.3 1.0 Unpredictable and quirky. Fun, but may wander or stop making sense.

In short: lower = safer and more repetitive, higher = more creative and more random.


Database

PostgreSQL stores every generation for monitoring and future data collection.

Schema β€” generations

Column Type Description
id Integer (PK) Auto-increment
prompt Text Input prompt
generated_text Text Model output
temperature Float Effective sampling temperature
top_p Float Effective nucleus threshold
max_new_tokens Integer Generation length
response_time_ms Float Latency in ms
created_at DateTime (indexed) Set by PostgreSQL

Stack: FastAPI β†’ SQLAlchemy ORM β†’ psycopg2 β†’ PostgreSQL. Tables are created on startup.


Project Structure

TinyGPT/
β”œβ”€β”€ transformer_model/
β”‚   β”œβ”€β”€ model.py          # GPT model, training loop, sampling, generation, data pipeline
β”‚   β”œβ”€β”€ layers.py         # Attention, RoPE, TransformerBlock, KV-cache
β”‚   β”œβ”€β”€ tokenizer.py      # HFTokenizer (used) + BPE_tokenizer / CharTokenizer (reference)
β”‚   └── generation.py     # Standalone generation script
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ server.py         # FastAPI inference server
β”‚   β”œβ”€β”€ crud.py           # DB read/write
β”‚   β”œβ”€β”€ database.py       # SQLAlchemy engine/session
β”‚   └── models_db.py      # generations table
β”œβ”€β”€ saved_models/
β”‚   β”œβ”€β”€ tinystories_tokenizer.json
β”‚   β”œβ”€β”€ tinystories_model.weights.h5
β”‚   └── ckpt_tinystories/        # resumable training checkpoints
β”œβ”€β”€ index.html            # Frontend UI
β”œβ”€β”€ requirements.txt          # full deps (training + serving, GPU)
β”œβ”€β”€ requirements-serve.txt    # slim serving-only deps (CPU)
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .dockerignore
└── docker-compose.yml

Docker

The container serves the model on CPU (no GPU needed for inference).

# build + run app and PostgreSQL together
POSTGRES_PASSWORD=yourpassword docker compose up --build
  • App: http://localhost:8000/
  • The image installs only requirements-serve.txt (CPU TensorFlow, no CUDA/training packages) and copies only the TinyStories tokenizer + weights β€” training checkpoints are excluded via .dockerignore.
  • DB credentials come from env vars: POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB (defaults: avi / changeme / tiny_gpt).

Setup & Installation

Prerequisites

  • Python 3.10
  • (Optional) CUDA-capable GPU. Tested on an RTX 3050 6GB under WSL2.
  • PostgreSQL 14

Install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For GPU training/inference, requirements.txt pins tensorflow[and-cuda]==2.21.0, which bundles the matching CUDA/cuDNN wheels.

PostgreSQL

sudo apt install postgresql postgresql-contrib
sudo service postgresql start
sudo -u postgres psql -c "CREATE DATABASE tiny_gpt;"

Set the connection string via env var:

export DATABASE_URL="postgresql://user:password@localhost:5432/tiny_gpt"

Running the Server

source .venv/bin/activate
uvicorn app.server:app --port 8000
  • UI: http://localhost:8000/
  • Interactive API docs: http://localhost:8000/docs

To force CPU inference: export CUDA_VISIBLE_DEVICES=-1 before launching.


Training From Scratch

cd transformer_model
python3 model.py

On the first run it trains and saves the tokenizer, then begins streaming TinyStoriesV2 and training. Progress prints per 50 updates, with validation every 1000. Best weights and resumable checkpoints are written to saved_models/. Re-running resumes from the last checkpoint.

Keep the process alive for long runs (tmux/nohup) and ensure the machine does not sleep.


Results

Trained on TinyStoriesV2 with the ~50M configuration, validation cross-entropy descends into the ~1.4 range and below, producing coherent short stories with consistent characters and a beginning/middle/end. Example (balanced creativity):

Prompt: "once upon a time there was a donkey named avi"

...She was three years old and loved to explore the world around her. One day she found a dull, old box in the garage. She was curious and wanted to open it... a voice said, "Don't worry, I can help you." It was her brother, Sam... Inside was a big, shiny ball... The moral of this story is that sometimes it's important to be curious.

Generation quality continues to improve as validation loss decreases over training.