🔄 Transformer from Scratch

A complete, annotated implementation of the Transformer architecture from "Attention Is All You Need" (Vaswani et al., 2017) in pure PyTorch. No external dependencies beyond torch.

📁 Files

File	Description
`transformer.py`	Complete Transformer implementation (~900 lines, heavily commented)
`train_copy_task.py`	Training demo on a copy task — proves the model learns

🏗️ Architecture

Every component is implemented from scratch following the original paper (arXiv: 1706.03762):

┌─────────────────────────────────────────────────────────────┐
│                     TRANSFORMER                              │
│                                                              │
│  ┌──────────────────┐         ┌──────────────────────┐      │
│  │     ENCODER       │         │       DECODER         │      │
│  │                    │         │                        │      │
│  │  ┌──────────────┐ │         │  ┌──────────────────┐ │      │
│  │  │ Encoder Layer │ │  ×N     │  │  Decoder Layer   │ │  ×N  │
│  │  │              │ │         │  │                  │ │      │
│  │  │  Self-Attn   │ │         │  │  Masked Self-Attn│ │      │
│  │  │  + LayerNorm │ │         │  │  + LayerNorm     │ │      │
│  │  │              │ │         │  │                  │ │      │
│  │  │  FFN         │ │───────▶ │  │  Cross-Attn     │ │      │
│  │  │  + LayerNorm │ │  enc    │  │  + LayerNorm     │ │      │
│  │  │              │ │  output │  │                  │ │      │
│  │  └──────────────┘ │         │  │  FFN             │ │      │
│  │                    │         │  │  + LayerNorm     │ │      │
│  │  Positional Enc.  │         │  └──────────────────┘ │      │
│  │  + Embedding ×√d  │         │  Positional Enc.      │      │
│  └──────────────────┘         │  + Embedding ×√d      │      │
│                                │                        │      │
│         Source                 │  Output Projection     │      │
│         Tokens                 │  → Vocabulary Logits   │      │
│                                └──────────────────────┘      │
└─────────────────────────────────────────────────────────────┘

🧩 Components

1. Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

Additive masking (-inf before softmax) for padding and causal constraints
Scaling by √d_k prevents softmax saturation

2. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O

Single big projection + reshape (efficient, equivalent to h separate projections)
Three uses: encoder self-attention, decoder masked self-attention, cross-attention

3. Positional Encoding

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Fixed sinusoidal (not learned), registered as buffer
Added to scaled embeddings

4. Feed-Forward Network

FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂

Applied identically at each position
d_model → d_ff (4×) → d_model

5. Masking

Padding mask: -inf for PAD tokens → zero attention weight after softmax
Causal mask: upper-triangular -inf → prevents attending to future tokens
Combined: added together for decoder self-attention

6. Training Utilities

Noam LR Schedule: warmup + inverse sqrt decay
Greedy Decoding: autoregressive inference with encoder caching
Weight Tying: shared target embedding and output projection

⚙️ Default Configuration (Paper Base Model)

Parameter	Value	Description
`d_model`	512	Model dimension
`n_heads`	8	Attention heads
`n_layers`	6	Encoder/decoder layers
`d_ff`	2048	FFN hidden dim (4×d_model)
`d_k = d_v`	64	Per-head dimension (d_model/n_heads)
`dropout`	0.1	Residual + embedding dropout
Total params	~54M	With vocab=10K

🚀 Quick Start

from transformer import Transformer, greedy_decode

# Build model with paper defaults
model = Transformer(
    src_vocab_size=32000,
    tgt_vocab_size=32000,
    d_model=512,
    n_heads=8,
    n_layers=6,
    d_ff=2048,
    dropout=0.1,
)

# Forward pass (training with teacher forcing)
src = torch.randint(1, 32000, (batch_size, src_len))
tgt = torch.randint(1, 32000, (batch_size, tgt_len))
logits = model(src, tgt[:, :-1])  # [batch, tgt_len-1, vocab]

# Loss computation
loss = criterion(
    logits.reshape(-1, logits.size(-1)),
    tgt[:, 1:].reshape(-1),
)

# Inference (greedy decoding)
output = greedy_decode(model, src, max_len=100, bos_idx=1, eos_idx=2)

🧪 Copy Task Demo

The copy task is a classic smoke test — the model must learn to reproduce its input:

python train_copy_task.py

Results (3000 steps, ~2 min on CPU):

Step     1 | Loss: 3.7956 | Acc: 6.9%
Step   300 | Loss: 0.1384 | Acc: 95.9%
Step   600 | Loss: 0.0192 | Acc: 99.5%
Step   900 | Loss: 0.0027 | Acc: 100.0%

EVALUATION: 10/10 (100%) copy accuracy ✅

📚 Key Design Decisions

Decision	This Implementation	Alternative
LayerNorm	Post-LN (original paper)	Pre-LN (more stable, used in modern models)
Activation	ReLU (original paper)	GELU (BERT), SwiGLU (LLaMA)
Positional Encoding	Sinusoidal (fixed)	Learned (similar results), RoPE (modern)
Weight Init	Xavier Uniform	Kaiming, custom per-layer
Weight Tying	Enabled (paper §3.4)	Separate embeddings

📖 Reference

@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and 
          Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and 
          Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  volume={30},
  year={2017}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support