--- license: mit language: - en --- ## πŸ‘Ό Simple Transformer ``` Author: Eshan Jayasundara Last Updated: 2nd of March 2025 Created: 28th of February 2025 ``` ``` About: └── Single head transformer (Transformer with self-attention training with teacher-forcing) ``` ``` Training: └── Teacher Forcing (Baseline) β”œβ”€β”€ During training, the actual ground-truth tokens (from the dataset) are fed as input to the decoder instead of using the model’s own predictions. β”œβ”€β”€ This makes training faster and ensures the model learns accurate token-to-token mappings. └── Drawback: At inference time, the model doesn't see ground-truth inputs, so errors can accumulate (called exposure bias). ``` ``` Vocabulary dataset (from huggingface): └── "yukiarimo/english-vocabulary" ``` ``` Simple Transformer Architecture: ``` ![Simple Transformer Architecture](all_simple_transformer_corrected.drawio.png) ``` Encoder β”œβ”€β”€ Input text β”‚ └── Eg: "Hello, how are you?" β”œβ”€β”€ Remove punctuation from input text β”œβ”€β”€ Input tokenization β”œβ”€β”€ Embedding lookup with torch.nn.Embedding β”œβ”€β”€ Positional encoding (sin, cosine) β”œβ”€β”€ Self-attention β”‚ β”œβ”€β”€ single-head β”‚ β”œβ”€β”€ Q = Wq @ Embedding β”‚ β”œβ”€β”€ K = Wk @ Embedding β”‚ └── V = Wv @ Embedding β”œβ”€β”€ Add and norm β”œβ”€β”€ Feed forward layer β”‚ β”œβ”€β”€ 2 hidden layers β”‚ β”œβ”€β”€ ReLU as the activation in hidden layer β”‚ β”œβ”€β”€ No activation at the output layer β”‚ └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim) β”œβ”€β”€ Add and norm (again) └── Save encoder out to be used in cross attention Decoder β”œβ”€β”€ Decoder teacher text (same as the target text but shifted right) β”‚ β”œβ”€β”€ Eg: Decoder teacher text - " hello, I'm fine." β”‚ └── Eg: target text - "hello, I'm fine. " β”œβ”€β”€ Remove punctuation from input text β”œβ”€β”€ Input tokenization β”œβ”€β”€ Embedding lookup with torch.nn.Embedding β”œβ”€β”€ Positional encoding (sin, cosine) β”œβ”€β”€ Masked-self-attention (single-head, new class signature for masked self attention introduced) β”‚ β”œβ”€β”€ single-head β”‚ β”œβ”€β”€ causal mask with triangular matrix β”‚ β”œβ”€β”€ Q = Wq @ Embedding β”‚ β”œβ”€β”€ K = Wk @ Embedding β”‚ └── V = Wv @ Embedding β”œβ”€β”€ Add and norm β”œβ”€β”€ Cross attention (same class signature used in the encoder self-attention can be used) β”‚ β”œβ”€β”€ single-head β”‚ β”œβ”€β”€ Q = Wq @ Add and normalized output from masked-self-attention β”‚ β”œβ”€β”€ K = Wk @ Encoder output β”‚ └── V = Wv @ Encoder output β”œβ”€β”€ Add and norm β”œβ”€β”€ Feed forward layer β”‚ β”œβ”€β”€ 2 hidden layers β”‚ β”œβ”€β”€ ReLU as the activation in hidden layer β”‚ β”œβ”€β”€ No activation at the output layer β”‚ └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim) β”œβ”€β”€ Add and norm (again) └── Linear layer (No activation or softmax as in 'Attention is all you need' is used here) Optimization β”œβ”€β”€ Initialize the Adam optimizer with the model’s parameters and a specified learning rate. β”‚ └── self.optimizer = torch.optim.Adam(params=self.parameters, lr=learning_rate) β”œβ”€β”€ Before computing gradients for the current batch, we reset any existing gradients from the previous iteration. β”‚ └── self.optimizer.zero_grad() β”œβ”€β”€ The model takes in `input_tokens` and `decoder_teacher_tokens` and performs a forward pass to compute `logits` β”‚ └── logits = self.forward(input_tokens, decoder_teacher_tokens) β”œβ”€β”€ The cross-entropy loss β”‚ β”œβ”€β”€ Measures the difference between the predicted token distribution (logits) and the actual target tokens (decoder_target_tokens). β”‚ β”œβ”€β”€ It expects logits to have raw scores (not probabilities), and it applies softmax internally. β”‚ └── loss = F.cross_entropy(logits, decoder_target_tokens) β”œβ”€β”€ Compute the gradients of the loss with respect to all trainable parameters in the model using automatic differentiation (backpropagation). β”‚ └── loss.backward() └── Optimizer updates the model's weights using the computed gradients. └── self.optimizer.step() After training, to calculate the output tokens -> text, 'Autoregressive text generation' is used (one word at a time) β”œβ”€β”€ Start with . (Initial input to the decoder) but input to the encoder is the `prompt`. β”œβ”€β”€ Model predicts the next token. β”œβ”€β”€ Append the predicted token to the sequence. β”œβ”€β”€ Repeat until an token or max length is reached. └── For illustration let's use words instead of tokens(numerical representation) hello hello I'm hello I'm good hello I'm good ``` ``` Feauter Improvements: β”œβ”€β”€ Multi-head attention instead of single-head attention. β”œβ”€β”€ Layer normalization instead of simple mean-variance normalization. └── Dropout layers for better generalization. ```