YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π Transformer from Scratch
A complete, annotated implementation of the Transformer architecture from "Attention Is All You Need" (Vaswani et al., 2017) in pure PyTorch. No external dependencies beyond torch.
π Files
| File | Description |
|---|---|
transformer.py |
Complete Transformer implementation (~900 lines, heavily commented) |
train_copy_task.py |
Training demo on a copy task β proves the model learns |
ποΈ Architecture
Every component is implemented from scratch following the original paper (arXiv: 1706.03762):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSFORMER β
β β
β ββββββββββββββββββββ ββββββββββββββββββββββββ β
β β ENCODER β β DECODER β β
β β β β β β
β β ββββββββββββββββ β β ββββββββββββββββββββ β β
β β β Encoder Layer β β ΓN β β Decoder Layer β β ΓN β
β β β β β β β β β β
β β β Self-Attn β β β β Masked Self-Attnβ β β
β β β + LayerNorm β β β β + LayerNorm β β β
β β β β β β β β β β
β β β FFN β βββββββββΆ β β Cross-Attn β β β
β β β + LayerNorm β β enc β β + LayerNorm β β β
β β β β β output β β β β β
β β ββββββββββββββββ β β β FFN β β β
β β β β β + LayerNorm β β β
β β Positional Enc. β β ββββββββββββββββββββ β β
β β + Embedding Γβd β β Positional Enc. β β
β ββββββββββββββββββββ β + Embedding Γβd β β
β β β β
β Source β Output Projection β β
β Tokens β β Vocabulary Logits β β
β ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π§© Components
1. Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QΒ·Kα΅ / βd_k) Β· V
- Additive masking (
-infbefore softmax) for padding and causal constraints - Scaling by
βd_kprevents softmax saturation
2. Multi-Head Attention
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) Β· W_O
- Single big projection + reshape (efficient, equivalent to h separate projections)
- Three uses: encoder self-attention, decoder masked self-attention, cross-attention
3. Positional Encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
- Fixed sinusoidal (not learned), registered as buffer
- Added to scaled embeddings
4. Feed-Forward Network
FFN(x) = ReLU(xΒ·Wβ + bβ)Β·Wβ + bβ
- Applied identically at each position
- d_model β d_ff (4Γ) β d_model
5. Masking
- Padding mask:
-inffor PAD tokens β zero attention weight after softmax - Causal mask: upper-triangular
-infβ prevents attending to future tokens - Combined: added together for decoder self-attention
6. Training Utilities
- Noam LR Schedule: warmup + inverse sqrt decay
- Greedy Decoding: autoregressive inference with encoder caching
- Weight Tying: shared target embedding and output projection
βοΈ Default Configuration (Paper Base Model)
| Parameter | Value | Description |
|---|---|---|
d_model |
512 | Model dimension |
n_heads |
8 | Attention heads |
n_layers |
6 | Encoder/decoder layers |
d_ff |
2048 | FFN hidden dim (4Γd_model) |
d_k = d_v |
64 | Per-head dimension (d_model/n_heads) |
dropout |
0.1 | Residual + embedding dropout |
| Total params | ~54M | With vocab=10K |
π Quick Start
from transformer import Transformer, greedy_decode
# Build model with paper defaults
model = Transformer(
src_vocab_size=32000,
tgt_vocab_size=32000,
d_model=512,
n_heads=8,
n_layers=6,
d_ff=2048,
dropout=0.1,
)
# Forward pass (training with teacher forcing)
src = torch.randint(1, 32000, (batch_size, src_len))
tgt = torch.randint(1, 32000, (batch_size, tgt_len))
logits = model(src, tgt[:, :-1]) # [batch, tgt_len-1, vocab]
# Loss computation
loss = criterion(
logits.reshape(-1, logits.size(-1)),
tgt[:, 1:].reshape(-1),
)
# Inference (greedy decoding)
output = greedy_decode(model, src, max_len=100, bos_idx=1, eos_idx=2)
π§ͺ Copy Task Demo
The copy task is a classic smoke test β the model must learn to reproduce its input:
python train_copy_task.py
Results (3000 steps, ~2 min on CPU):
Step 1 | Loss: 3.7956 | Acc: 6.9%
Step 300 | Loss: 0.1384 | Acc: 95.9%
Step 600 | Loss: 0.0192 | Acc: 99.5%
Step 900 | Loss: 0.0027 | Acc: 100.0%
EVALUATION: 10/10 (100%) copy accuracy β
π Key Design Decisions
| Decision | This Implementation | Alternative |
|---|---|---|
| LayerNorm | Post-LN (original paper) | Pre-LN (more stable, used in modern models) |
| Activation | ReLU (original paper) | GELU (BERT), SwiGLU (LLaMA) |
| Positional Encoding | Sinusoidal (fixed) | Learned (similar results), RoPE (modern) |
| Weight Init | Xavier Uniform | Kaiming, custom per-layer |
| Weight Tying | Enabled (paper Β§3.4) | Separate embeddings |
π Reference
@inproceedings{vaswani2017attention,
title={Attention is all you need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and
Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and
Kaiser, {\L}ukasz and Polosukhin, Illia},
booktitle={Advances in Neural Information Processing Systems},
volume={30},
year={2017}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support