|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
## πΌ Simple Transformer |
|
|
``` |
|
|
Author: Eshan Jayasundara |
|
|
Last Updated: 2nd of March 2025 |
|
|
Created: 28th of February 2025 |
|
|
``` |
|
|
``` |
|
|
About: |
|
|
βββ Single head transformer (Transformer with self-attention training with teacher-forcing) |
|
|
``` |
|
|
``` |
|
|
Training: |
|
|
βββ Teacher Forcing (Baseline) |
|
|
βββ During training, the actual ground-truth tokens (from the dataset) are fed as input to the decoder instead of using the modelβs own predictions. |
|
|
βββ This makes training faster and ensures the model learns accurate token-to-token mappings. |
|
|
βββ Drawback: At inference time, the model doesn't see ground-truth inputs, so errors can accumulate (called exposure bias). |
|
|
``` |
|
|
``` |
|
|
Vocabulary dataset (from huggingface): |
|
|
βββ "yukiarimo/english-vocabulary" |
|
|
``` |
|
|
``` |
|
|
Simple Transformer Architecture: |
|
|
``` |
|
|
 |
|
|
``` |
|
|
Encoder |
|
|
βββ Input text |
|
|
β βββ Eg: "Hello, how are you?" |
|
|
βββ Remove punctuation from input text |
|
|
βββ Input tokenization |
|
|
βββ Embedding lookup with torch.nn.Embedding |
|
|
βββ Positional encoding (sin, cosine) |
|
|
βββ Self-attention |
|
|
β βββ single-head |
|
|
β βββ Q = Wq @ Embedding |
|
|
β βββ K = Wk @ Embedding |
|
|
β βββ V = Wv @ Embedding |
|
|
βββ Add and norm |
|
|
βββ Feed forward layer |
|
|
β βββ 2 hidden layers |
|
|
β βββ ReLU as the activation in hidden layer |
|
|
β βββ No activation at the output layer |
|
|
β βββ nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim) |
|
|
βββ Add and norm (again) |
|
|
βββ Save encoder out to be used in cross attention |
|
|
|
|
|
Decoder |
|
|
βββ Decoder teacher text (same as the target text but shifted right) |
|
|
β βββ Eg: Decoder teacher text - "<SOS> hello, I'm fine." |
|
|
β βββ Eg: target text - "hello, I'm fine. <EOS>" |
|
|
βββ Remove punctuation from input text |
|
|
βββ Input tokenization |
|
|
βββ Embedding lookup with torch.nn.Embedding |
|
|
βββ Positional encoding (sin, cosine) |
|
|
βββ Masked-self-attention (single-head, new class signature for masked self attention introduced) |
|
|
β βββ single-head |
|
|
β βββ causal mask with triangular matrix |
|
|
β βββ Q = Wq @ Embedding |
|
|
β βββ K = Wk @ Embedding |
|
|
β βββ V = Wv @ Embedding |
|
|
βββ Add and norm |
|
|
βββ Cross attention (same class signature used in the encoder self-attention can be used) |
|
|
β βββ single-head |
|
|
β βββ Q = Wq @ Add and normalized output from masked-self-attention |
|
|
β βββ K = Wk @ Encoder output |
|
|
β βββ V = Wv @ Encoder output |
|
|
βββ Add and norm |
|
|
βββ Feed forward layer |
|
|
β βββ 2 hidden layers |
|
|
β βββ ReLU as the activation in hidden layer |
|
|
β βββ No activation at the output layer |
|
|
β βββ nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim) |
|
|
βββ Add and norm (again) |
|
|
βββ Linear layer (No activation or softmax as in 'Attention is all you need' is used here) |
|
|
|
|
|
Optimization |
|
|
βββ Initialize the Adam optimizer with the modelβs parameters and a specified learning rate. |
|
|
β βββ self.optimizer = torch.optim.Adam(params=self.parameters, lr=learning_rate) |
|
|
βββ Before computing gradients for the current batch, we reset any existing gradients from the previous iteration. |
|
|
β βββ self.optimizer.zero_grad() |
|
|
βββ The model takes in `input_tokens` and `decoder_teacher_tokens` and performs a forward pass to compute `logits` |
|
|
β βββ logits = self.forward(input_tokens, decoder_teacher_tokens) |
|
|
βββ The cross-entropy loss |
|
|
β βββ Measures the difference between the predicted token distribution (logits) and the actual target tokens (decoder_target_tokens). |
|
|
β βββ It expects logits to have raw scores (not probabilities), and it applies softmax internally. |
|
|
β βββ loss = F.cross_entropy(logits, decoder_target_tokens) |
|
|
βββ Compute the gradients of the loss with respect to all trainable parameters in the model using automatic differentiation (backpropagation). |
|
|
β βββ loss.backward() |
|
|
βββ Optimizer updates the model's weights using the computed gradients. |
|
|
βββ self.optimizer.step() |
|
|
|
|
|
After training, to calculate the output tokens -> text, 'Autoregressive text generation' is used (one word at a time) |
|
|
βββ Start with <SOS>. (Initial input to the decoder) but input to the encoder is the `prompt`. |
|
|
βββ Model predicts the next token. |
|
|
βββ Append the predicted token to the sequence. |
|
|
βββ Repeat until an <EOS> token or max length is reached. |
|
|
βββ For illustration let's use words instead of tokens(numerical representation) |
|
|
<SOS> |
|
|
<SOS> hello |
|
|
<SOS> hello I'm |
|
|
<SOS> hello I'm good |
|
|
<SOS> hello I'm good <EOS> |
|
|
``` |
|
|
``` |
|
|
Feauter Improvements: |
|
|
βββ Multi-head attention instead of single-head attention. |
|
|
βββ Layer normalization instead of simple mean-variance normalization. |
|
|
βββ Dropout layers for better generalization. |
|
|
``` |