Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
---
|
| 6 |
+
```
|
| 7 |
+
Author: Eshan Jayasundara
|
| 8 |
+
Last Updated: 2nd of March 2025
|
| 9 |
+
Created: 28th of February 2025
|
| 10 |
+
```
|
| 11 |
+
```
|
| 12 |
+
About:
|
| 13 |
+
βββ Single head transformer (Transformer with self-attention training with teacher-forcing)
|
| 14 |
+
```
|
| 15 |
+
```
|
| 16 |
+
Training:
|
| 17 |
+
βββ Teacher Forcing (Baseline)
|
| 18 |
+
βββ During training, the actual ground-truth tokens (from the dataset) are fed as input to the decoder instead of using the modelβs own predictions.
|
| 19 |
+
βββ This makes training faster and ensures the model learns accurate token-to-token mappings.
|
| 20 |
+
βββ Drawback: At inference time, the model doesn't see ground-truth inputs, so errors can accumulate (called exposure bias).
|
| 21 |
+
```
|
| 22 |
+
```
|
| 23 |
+
Vocabulary dataset (from huggingface):
|
| 24 |
+
βββ "yukiarimo/english-vocabulary"
|
| 25 |
+
```
|
| 26 |
+
```
|
| 27 |
+
Architecture:
|
| 28 |
+
|
| 29 |
+
Encoder
|
| 30 |
+
βββ Input text
|
| 31 |
+
β βββ Eg: "Hello, how are you?"
|
| 32 |
+
βββ Remove punctuation from input text
|
| 33 |
+
βββ Input tokenization
|
| 34 |
+
βββ Embedding lookup with torch.nn.Embedding
|
| 35 |
+
βββ Positional encoding (sin, cosine)
|
| 36 |
+
βββ Self-attention
|
| 37 |
+
β βββ single-head
|
| 38 |
+
β βββ Q = Wq @ Embedding
|
| 39 |
+
β βββ K = Wk @ Embedding
|
| 40 |
+
β βββ V = Wv @ Embedding
|
| 41 |
+
βββ Add and norm
|
| 42 |
+
βββ Feed forward layer
|
| 43 |
+
β βββ 2 hidden layers
|
| 44 |
+
β βββ ReLU as the activation in hidden layer
|
| 45 |
+
β βββ No activation at the output layer
|
| 46 |
+
β βββ nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
|
| 47 |
+
βββ Add and norm (again)
|
| 48 |
+
βββ Save encoder out to be used in cross attention
|
| 49 |
+
|
| 50 |
+
Decoder
|
| 51 |
+
βββ Decoder teacher text (same as the target text but shifted right)
|
| 52 |
+
β βββ Eg: Decoder teacher text - "<SOS> hello, I'm fine."
|
| 53 |
+
β βββ Eg: target text - "hello, I'm fine. <EOS>"
|
| 54 |
+
βββ Remove punctuation from input text
|
| 55 |
+
βββ Input tokenization
|
| 56 |
+
βββ Embedding lookup with torch.nn.Embedding
|
| 57 |
+
βββ Positional encoding (sin, cosine)
|
| 58 |
+
βββ Masked-self-attention (single-head, new class signature for masked self attention introduced)
|
| 59 |
+
β βββ single-head
|
| 60 |
+
β βββ causal mask with triangular matrix
|
| 61 |
+
β βββ Q = Wq @ Embedding
|
| 62 |
+
β βββ K = Wk @ Embedding
|
| 63 |
+
β βββ V = Wv @ Embedding
|
| 64 |
+
βββ Add and norm
|
| 65 |
+
βββ Cross attention (same class signature used in the encoder self-attention can be used)
|
| 66 |
+
β βββ single-head
|
| 67 |
+
β βββ Q = Wq @ Add and normalized output from masked-self-attention
|
| 68 |
+
β βββ K = Wk @ Encoder output
|
| 69 |
+
β βββ V = Wv @ Encoder output
|
| 70 |
+
βββ Add and norm
|
| 71 |
+
βββ Feed forward layer
|
| 72 |
+
β βββ 2 hidden layers
|
| 73 |
+
β βββ ReLU as the activation in hidden layer
|
| 74 |
+
β βββ No activation at the output layer
|
| 75 |
+
β βββ nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
|
| 76 |
+
βββ Add and norm (again)
|
| 77 |
+
βββ Linear layer (No activation or softmax as in 'Attention is all you need' is used here)
|
| 78 |
+
|
| 79 |
+
Optimization
|
| 80 |
+
βββ Initialize the Adam optimizer with the modelβs parameters and a specified learning rate.
|
| 81 |
+
β βββ self.optimizer = torch.optim.Adam(params=self.parameters, lr=learning_rate)
|
| 82 |
+
βββ Before computing gradients for the current batch, we reset any existing gradients from the previous iteration.
|
| 83 |
+
β βββ self.optimizer.zero_grad()
|
| 84 |
+
βββ The model takes in `input_tokens` and `decoder_teacher_tokens` and performs a forward pass to compute `logits`
|
| 85 |
+
β βββ logits = self.forward(input_tokens, decoder_teacher_tokens)
|
| 86 |
+
βββ The cross-entropy loss
|
| 87 |
+
β βββ Measures the difference between the predicted token distribution (logits) and the actual target tokens (decoder_target_tokens).
|
| 88 |
+
β βββ It expects logits to have raw scores (not probabilities), and it applies softmax internally.
|
| 89 |
+
β βββ loss = F.cross_entropy(logits, decoder_target_tokens)
|
| 90 |
+
βββ Compute the gradients of the loss with respect to all trainable parameters in the model using automatic differentiation (backpropagation).
|
| 91 |
+
β βββ loss.backward()
|
| 92 |
+
βββ Optimizer updates the model's weights using the computed gradients.
|
| 93 |
+
βββ self.optimizer.step()
|
| 94 |
+
|
| 95 |
+
After training, to calculate the output tokens -> text, 'Autoregressive text generation' is used (one word at a time)
|
| 96 |
+
βββ Start with <SOS>. (Initial input to the decoder) but input to the encoder is the `prompt`.
|
| 97 |
+
βββ Model predicts the next token.
|
| 98 |
+
βββ Append the predicted token to the sequence.
|
| 99 |
+
βββ Repeat until an <EOS> token or max length is reached.
|
| 100 |
+
βββ For illustration let's use words instead of tokens(numerical representation)
|
| 101 |
+
<SOS>
|
| 102 |
+
<SOS> hello
|
| 103 |
+
<SOS> hello I'm
|
| 104 |
+
<SOS> hello I'm good
|
| 105 |
+
<SOS> hello I'm good <EOS>
|
| 106 |
+
```
|
| 107 |
+
```
|
| 108 |
+
Feauter Improvements:
|
| 109 |
+
βββ Multi-head attention instead of single-head attention.
|
| 110 |
+
βββ Layer normalization instead of simple mean-variance normalization.
|
| 111 |
+
βββ Dropout layers for better generalization.
|
| 112 |
+
```
|