YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ”„ Transformer from Scratch

A complete, annotated implementation of the Transformer architecture from "Attention Is All You Need" (Vaswani et al., 2017) in pure PyTorch. No external dependencies beyond torch.

πŸ“ Files

File Description
transformer.py Complete Transformer implementation (~900 lines, heavily commented)
train_copy_task.py Training demo on a copy task β€” proves the model learns

πŸ—οΈ Architecture

Every component is implemented from scratch following the original paper (arXiv: 1706.03762):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     TRANSFORMER                              β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚     ENCODER       β”‚         β”‚       DECODER         β”‚      β”‚
β”‚  β”‚                    β”‚         β”‚                        β”‚      β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚      β”‚
β”‚  β”‚  β”‚ Encoder Layer β”‚ β”‚  Γ—N     β”‚  β”‚  Decoder Layer   β”‚ β”‚  Γ—N  β”‚
β”‚  β”‚  β”‚              β”‚ β”‚         β”‚  β”‚                  β”‚ β”‚      β”‚
β”‚  β”‚  β”‚  Self-Attn   β”‚ β”‚         β”‚  β”‚  Masked Self-Attnβ”‚ β”‚      β”‚
β”‚  β”‚  β”‚  + LayerNorm β”‚ β”‚         β”‚  β”‚  + LayerNorm     β”‚ β”‚      β”‚
β”‚  β”‚  β”‚              β”‚ β”‚         β”‚  β”‚                  β”‚ β”‚      β”‚
β”‚  β”‚  β”‚  FFN         β”‚ │───────▢ β”‚  β”‚  Cross-Attn     β”‚ β”‚      β”‚
β”‚  β”‚  β”‚  + LayerNorm β”‚ β”‚  enc    β”‚  β”‚  + LayerNorm     β”‚ β”‚      β”‚
β”‚  β”‚  β”‚              β”‚ β”‚  output β”‚  β”‚                  β”‚ β”‚      β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚         β”‚  β”‚  FFN             β”‚ β”‚      β”‚
β”‚  β”‚                    β”‚         β”‚  β”‚  + LayerNorm     β”‚ β”‚      β”‚
β”‚  β”‚  Positional Enc.  β”‚         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚      β”‚
β”‚  β”‚  + Embedding Γ—βˆšd  β”‚         β”‚  Positional Enc.      β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  + Embedding Γ—βˆšd      β”‚      β”‚
β”‚                                β”‚                        β”‚      β”‚
β”‚         Source                 β”‚  Output Projection     β”‚      β”‚
β”‚         Tokens                 β”‚  β†’ Vocabulary Logits   β”‚      β”‚
β”‚                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧩 Components

1. Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QΒ·Kα΅€ / √d_k) Β· V
  • Additive masking (-inf before softmax) for padding and causal constraints
  • Scaling by √d_k prevents softmax saturation

2. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) Β· W_O
  • Single big projection + reshape (efficient, equivalent to h separate projections)
  • Three uses: encoder self-attention, decoder masked self-attention, cross-attention

3. Positional Encoding

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  • Fixed sinusoidal (not learned), registered as buffer
  • Added to scaled embeddings

4. Feed-Forward Network

FFN(x) = ReLU(xΒ·W₁ + b₁)Β·Wβ‚‚ + bβ‚‚
  • Applied identically at each position
  • d_model β†’ d_ff (4Γ—) β†’ d_model

5. Masking

  • Padding mask: -inf for PAD tokens β†’ zero attention weight after softmax
  • Causal mask: upper-triangular -inf β†’ prevents attending to future tokens
  • Combined: added together for decoder self-attention

6. Training Utilities

  • Noam LR Schedule: warmup + inverse sqrt decay
  • Greedy Decoding: autoregressive inference with encoder caching
  • Weight Tying: shared target embedding and output projection

βš™οΈ Default Configuration (Paper Base Model)

Parameter Value Description
d_model 512 Model dimension
n_heads 8 Attention heads
n_layers 6 Encoder/decoder layers
d_ff 2048 FFN hidden dim (4Γ—d_model)
d_k = d_v 64 Per-head dimension (d_model/n_heads)
dropout 0.1 Residual + embedding dropout
Total params ~54M With vocab=10K

πŸš€ Quick Start

from transformer import Transformer, greedy_decode

# Build model with paper defaults
model = Transformer(
    src_vocab_size=32000,
    tgt_vocab_size=32000,
    d_model=512,
    n_heads=8,
    n_layers=6,
    d_ff=2048,
    dropout=0.1,
)

# Forward pass (training with teacher forcing)
src = torch.randint(1, 32000, (batch_size, src_len))
tgt = torch.randint(1, 32000, (batch_size, tgt_len))
logits = model(src, tgt[:, :-1])  # [batch, tgt_len-1, vocab]

# Loss computation
loss = criterion(
    logits.reshape(-1, logits.size(-1)),
    tgt[:, 1:].reshape(-1),
)

# Inference (greedy decoding)
output = greedy_decode(model, src, max_len=100, bos_idx=1, eos_idx=2)

πŸ§ͺ Copy Task Demo

The copy task is a classic smoke test β€” the model must learn to reproduce its input:

python train_copy_task.py

Results (3000 steps, ~2 min on CPU):

Step     1 | Loss: 3.7956 | Acc: 6.9%
Step   300 | Loss: 0.1384 | Acc: 95.9%
Step   600 | Loss: 0.0192 | Acc: 99.5%
Step   900 | Loss: 0.0027 | Acc: 100.0%

EVALUATION: 10/10 (100%) copy accuracy βœ…

πŸ“š Key Design Decisions

Decision This Implementation Alternative
LayerNorm Post-LN (original paper) Pre-LN (more stable, used in modern models)
Activation ReLU (original paper) GELU (BERT), SwiGLU (LLaMA)
Positional Encoding Sinusoidal (fixed) Learned (similar results), RoPE (modern)
Weight Init Xavier Uniform Kaiming, custom per-layer
Weight Tying Enabled (paper Β§3.4) Separate embeddings

πŸ“– Reference

@inproceedings{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and 
          Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and 
          Kaiser, {\L}ukasz and Polosukhin, Illia},
  booktitle={Advances in Neural Information Processing Systems},
  volume={30},
  year={2017}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support