title: TinyGPT
emoji: π
colorFrom: red
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
TinyGPT
A small GPT-style language model built and trained from scratch to write short children's stories. Implemented in TensorFlow/Keras with a custom transformer (RoPE attention, weight tying, KV-cache), a byte-level BPE tokenizer, and a FastAPI inference server backed by PostgreSQL.
Table of Contents
- Overview
- Model Architecture
- Tokenizer
- Positional Encoding β RoPE
- Training
- Dataset
- Inference & Sampling
- KV-Cache
- API Server
- Creativity Levels
- Database
- Project Structure
- Setup & Installation
- Running the Server
- Training From Scratch
- Results
Overview
TinyGPT is a decoder-only transformer trained from scratch, without pretrained weights. Every core component β the tokenizer, attention with rotary embeddings, the training loop, the sampling code, and the KV-cache used for fast generation β is implemented directly.
The model is trained on the TinyStories dataset and learns to generate coherent, grammatically correct short stories from a prompt. The narrow, uniform story domain is what lets a model this small produce genuinely readable output.
Parameters: ~50M
Architecture: GPT-style decoder-only transformer (GPT-3 inspired)
Positional: RoPE (rotary position embeddings)
Tokenizer: Byte-level BPE (HuggingFace tokenizers), 10k vocab
Framework: TensorFlow 2.x / Keras (mixed bfloat16)
Dataset: TinyStoriesV2 (noanabeshima/TinyStoriesV2)
Model Architecture
A decoder-only transformer following GPT-2/3 design, with RoPE instead of learned absolute positions.
Hyperparameters
| Parameter | Value |
|---|---|
d_model |
640 |
num_heads |
10 (head_dim = 64) |
dff (feed-forward dim) |
2560 |
num_layers |
10 |
seq_len |
256 |
dropout_rate |
0.1 |
vocab_size |
10000 |
Components
Token Embedding
- Learned embedding matrix of shape
(vocab_size, d_model), initializedN(0, 0.02). - Position is handled by RoPE inside attention, so there is no separate positional embedding table.
Transformer Blocks (Γ10) β Pre-LayerNorm (GPT-3 style):
x β LayerNorm β MultiHeadAttention β x + residual
β LayerNorm β FFN (GELU) β x + residual
Pre-LN keeps the residual stream clean and stabilizes deep training.
Multi-Head Attention
- 10 heads,
head_dim = 64. - Causal masking; RoPE applied to Q and K before the attention scores.
- Attention-probability dropout (GPT-3 style).
- KV-cache for fast autoregressive generation.
Feed-Forward Network
d_model β dff β d_model, GELU activation.
Initialization (GPT-2/3 scheme)
- Weights
N(0, 0.02); biases zero; LayerNorm Ξ³=1, Ξ²=0. - Residual output projections (attention
denseand FFNdense2) scaled by0.02 / sqrt(2 Β· num_layers)for stability.
Weight Tying
- The output projection reuses the token-embedding matrix, saving
vocab_size Γ d_modelparameters.
Final LayerNorm before the output projection; logits are cast to float32.
Parameter Count (~50M)
With d_model=640, dff=2560, num_layers=10, vocab=10000:
| Component | Params |
|---|---|
| Token embedding (tied) | 6.4M |
| 10 Γ transformer block (~4.9M each) | ~49M |
| Final LayerNorm | ~0.001M |
| Total | ~50.5M |
Per block β 12 Β· d_modelΒ² (4Β·dΒ² attention + 8Β·dΒ² FFN) plus biases and LayerNorm.
Tokenizer
A byte-level BPE tokenizer (HuggingFace tokenizers, Rust-backed) with a 10k vocabulary, trained on a sample of TinyStories.
- Byte-level: any unicode/whitespace round-trips cleanly; no
KeyErroron unseen characters. - Special tokens:
<|endoftext|>(id 0, EOS) and<|unk|>(id 1), reserved at the lowest ids. - Fast: encoding the corpus is fast enough that CPU tokenization never starves the GPU during training.
- Persistence: saved to
saved_models/tinystories_tokenizer.json.
A from-scratch pure-Python BPE (BPE_tokenizer) and a CharTokenizer also live in tokenizer.py for reference; the trained model uses HFTokenizer.
Positional Encoding β RoPE
TinyGPT uses Rotary Positional Embeddings (RoPE) instead of learned or sinusoidal absolute encodings.
RoPE rotates Q and K by an angle proportional to token position:
q_rotated = q Β· cos(mΞΈ) + rotate_half(q) Β· sin(mΞΈ)
k_rotated = k Β· cos(mΞΈ) + rotate_half(k) Β· sin(mΞΈ)
When q Β· k is computed, the rotations combine to encode the relative distance between tokens. The sin/cos tables are precomputed once for max_len. During cached generation, each new token is rotated at its correct absolute position via a position offset equal to the current cache length.
Advantages: relative position directly in the attention dot product, better length behavior, no extra parameters. Used in LLaMA, Mistral, Qwen, and most modern LLMs.
Training
Optimizer
Adam with GPT-3 hyperparameters:
| Parameter | Value |
|---|---|
beta_1 |
0.9 |
beta_2 |
0.95 |
epsilon |
1e-8 |
clipnorm |
1.0 |
Learning Rate β Warmup + Cosine Decay
Linear warmup over the first 10% of updates to peak_lr = 3e-4, then cosine decay to 0.1 Γ peak_lr. The schedule is driven by the number of optimizer updates, not micro-batches.
Mixed Precision (bfloat16)
mixed_bfloat16 is used. bf16 has the same exponent range as float32, so it does not require loss scaling (unlike float16). This avoids the LossScaleOptimizer machinery entirely and is stable on Ampere GPUs (RTX 3050+).
Gradient Accumulation
To fit the model on a 6GB GPU while keeping a useful effective batch size:
micro_batch_size = 4
accum_steps = 4
effective_batch = 16
Gradients are summed over accum_steps micro-batches, then applied as a single update.
Streaming Data Pipeline
Data is streamed and tokenized on the fly (no giant pre-tokenized array on disk):
HF streaming dataset β text docs β BPE tokens + <|endoftext|> boundaries
β packed into seq_len chunks β tf.data β batched β prefetched
The streamer retries on transient network errors and can loop the dataset to meet a token budget larger than the dataset itself.
Checkpointing & Resume
- Validation loss is checked every 1000 updates on an in-memory held-out set; best weights are saved to
saved_models/tinystories_model.weights.h5. - A
tf.train.Checkpoint(model + optimizer + step counter) is saved tosaved_models/ckpt_tinystories/, so a crashed run resumes from the last checkpoint.
Dataset
TinyStoriesV2 β short children's stories generated by GPT-4, designed for training and evaluating small language models.
| Property | Value |
|---|---|
| Source | noanabeshima/TinyStoriesV2 |
| Stories | ~2.75M |
| Approx tokens | ~520β580M |
| Sequence length | 256 |
Why TinyStories?
Small, uniform vocabulary; clear narrative structure; short sentences. A ~50M model can actually master this narrow distribution, which is why the output is coherent. Trained on broad web text instead, a model this size produces fluent but meaningless text β the narrow domain is the point.
Inference & Sampling
Three sampling strategies are implemented:
- Temperature β divides logits before softmax. Lower = safer/more repetitive; higher = more random/creative.
- Top-K β keep only the K highest-probability tokens.
- Top-P (nucleus) β keep the smallest set of tokens whose cumulative probability exceeds
p.
Generation stops early when the model emits <|endoftext|>.
KV-Cache
Autoregressive generation uses a KV-cache so each new token reuses the Keys/Values computed for all previous tokens instead of recomputing the whole sequence.
Flow:
- Prefill: the prompt is processed once; each layer returns its K/V (rotated by RoPE at absolute positions).
- Decode: each new token is processed alone; its Q/K is rotated at
offset = current_cache_length, concatenated onto the cached K/V, and attends over the full history. - Caches are passed in and returned from each
call(not mutated in place), so they persist correctly across steps.
This makes generation O(n) per step instead of O(nΒ²), and cached vs full-recompute outputs match exactly.
API Server
FastAPI server exposing the model as a REST API.
Endpoints
| Method | Route | Description |
|---|---|---|
GET |
/ |
Serves the frontend UI |
GET |
/health |
Model status, parameter count, vocab size |
GET |
/creativity-levels |
The named creativity presets with explanations |
POST |
/generate |
Generate a story from a prompt |
GET |
/history |
Retrieve past generations |
DELETE |
/history/{id} |
Delete a stored generation |
Generate Request
{
"prompt": "Once upon a time there was a little dragon",
"creativity": "balanced",
"max_new_tokens": 200
}
creativity is one of predictable | balanced | creative | wild (see below). Advanced callers may instead pass raw temperature and top_p, which override the preset.
Generate Response
{
"prompt": "Once upon a time there was a little dragon",
"generated_text": "Once upon a time there was a little dragon ...",
"creativity": "balanced",
"creativity_description": "A good mix of sense and surprise ...",
"temperature": 0.8,
"top_p": 0.9,
"max_new_tokens": 200,
"response_time_ms": 812.4
}
Creativity Levels
Rather than asking users to guess what "temperature" means, the API exposes named levels, each with a plain-language description of how it changes the story. Fetch them from /creativity-levels:
| Level | temperature | top_p | What it does to your stories |
|---|---|---|---|
| Predictable | 0.6 | 0.85 | Safe and focused. Simple, calm, easy-to-follow tales. |
| Balanced (default) | 0.8 | 0.9 | A good mix of sense and surprise. Recommended for most prompts. |
| Creative | 1.0 | 0.95 | More imaginative and varied, with the odd unexpected twist. |
| Wild | 1.3 | 1.0 | Unpredictable and quirky. Fun, but may wander or stop making sense. |
In short: lower = safer and more repetitive, higher = more creative and more random.
Database
PostgreSQL stores every generation for monitoring and future data collection.
Schema β generations
| Column | Type | Description |
|---|---|---|
id |
Integer (PK) | Auto-increment |
prompt |
Text | Input prompt |
generated_text |
Text | Model output |
temperature |
Float | Effective sampling temperature |
top_p |
Float | Effective nucleus threshold |
max_new_tokens |
Integer | Generation length |
response_time_ms |
Float | Latency in ms |
created_at |
DateTime (indexed) | Set by PostgreSQL |
Stack: FastAPI β SQLAlchemy ORM β psycopg2 β PostgreSQL. Tables are created on startup.
Project Structure
TinyGPT/
βββ transformer_model/
β βββ model.py # GPT model, training loop, sampling, generation, data pipeline
β βββ layers.py # Attention, RoPE, TransformerBlock, KV-cache
β βββ tokenizer.py # HFTokenizer (used) + BPE_tokenizer / CharTokenizer (reference)
β βββ generation.py # Standalone generation script
βββ app/
β βββ server.py # FastAPI inference server
β βββ crud.py # DB read/write
β βββ database.py # SQLAlchemy engine/session
β βββ models_db.py # generations table
βββ saved_models/
β βββ tinystories_tokenizer.json
β βββ tinystories_model.weights.h5
β βββ ckpt_tinystories/ # resumable training checkpoints
βββ index.html # Frontend UI
βββ requirements.txt # full deps (training + serving, GPU)
βββ requirements-serve.txt # slim serving-only deps (CPU)
βββ Dockerfile
βββ .dockerignore
βββ docker-compose.yml
Docker
The container serves the model on CPU (no GPU needed for inference).
# build + run app and PostgreSQL together
POSTGRES_PASSWORD=yourpassword docker compose up --build
- App:
http://localhost:8000/ - The image installs only
requirements-serve.txt(CPU TensorFlow, no CUDA/training packages) and copies only the TinyStories tokenizer + weights β training checkpoints are excluded via.dockerignore. - DB credentials come from env vars:
POSTGRES_USER,POSTGRES_PASSWORD,POSTGRES_DB(defaults:avi/changeme/tiny_gpt).
Setup & Installation
Prerequisites
- Python 3.10
- (Optional) CUDA-capable GPU. Tested on an RTX 3050 6GB under WSL2.
- PostgreSQL 14
Install
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
For GPU training/inference, requirements.txt pins tensorflow[and-cuda]==2.21.0, which bundles the matching CUDA/cuDNN wheels.
PostgreSQL
sudo apt install postgresql postgresql-contrib
sudo service postgresql start
sudo -u postgres psql -c "CREATE DATABASE tiny_gpt;"
Set the connection string via env var:
export DATABASE_URL="postgresql://user:password@localhost:5432/tiny_gpt"
Running the Server
source .venv/bin/activate
uvicorn app.server:app --port 8000
- UI:
http://localhost:8000/ - Interactive API docs:
http://localhost:8000/docs
To force CPU inference: export CUDA_VISIBLE_DEVICES=-1 before launching.
Training From Scratch
cd transformer_model
python3 model.py
On the first run it trains and saves the tokenizer, then begins streaming TinyStoriesV2 and training. Progress prints per 50 updates, with validation every 1000. Best weights and resumable checkpoints are written to saved_models/. Re-running resumes from the last checkpoint.
Keep the process alive for long runs (tmux/nohup) and ensure the machine does not sleep.
Results
Trained on TinyStoriesV2 with the ~50M configuration, validation cross-entropy descends into the ~1.4 range and below, producing coherent short stories with consistent characters and a beginning/middle/end. Example (balanced creativity):
Prompt: "once upon a time there was a donkey named avi"
...She was three years old and loved to explore the world around her. One day she found a dull, old box in the garage. She was curious and wanted to open it... a voice said, "Don't worry, I can help you." It was her brother, Sam... Inside was a big, shiny ball... The moral of this story is that sometimes it's important to be curious.
Generation quality continues to improve as validation loss decreases over training.