eshangj commited on
Commit
0a0585d
Β·
verified Β·
1 Parent(s): 33402f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
6
+ ```
7
+ Author: Eshan Jayasundara
8
+ Last Updated: 2nd of March 2025
9
+ Created: 28th of February 2025
10
+ ```
11
+ ```
12
+ About:
13
+ └── Single head transformer (Transformer with self-attention training with teacher-forcing)
14
+ ```
15
+ ```
16
+ Training:
17
+ └── Teacher Forcing (Baseline)
18
+ β”œβ”€β”€ During training, the actual ground-truth tokens (from the dataset) are fed as input to the decoder instead of using the model’s own predictions.
19
+ β”œβ”€β”€ This makes training faster and ensures the model learns accurate token-to-token mappings.
20
+ └── Drawback: At inference time, the model doesn't see ground-truth inputs, so errors can accumulate (called exposure bias).
21
+ ```
22
+ ```
23
+ Vocabulary dataset (from huggingface):
24
+ └── "yukiarimo/english-vocabulary"
25
+ ```
26
+ ```
27
+ Architecture:
28
+
29
+ Encoder
30
+ β”œβ”€β”€ Input text
31
+ β”‚ └── Eg: "Hello, how are you?"
32
+ β”œβ”€β”€ Remove punctuation from input text
33
+ β”œβ”€β”€ Input tokenization
34
+ β”œβ”€β”€ Embedding lookup with torch.nn.Embedding
35
+ β”œβ”€β”€ Positional encoding (sin, cosine)
36
+ β”œβ”€β”€ Self-attention
37
+ β”‚ β”œβ”€β”€ single-head
38
+ β”‚ β”œβ”€β”€ Q = Wq @ Embedding
39
+ β”‚ β”œβ”€β”€ K = Wk @ Embedding
40
+ β”‚ └── V = Wv @ Embedding
41
+ β”œβ”€β”€ Add and norm
42
+ β”œβ”€β”€ Feed forward layer
43
+ β”‚ β”œβ”€β”€ 2 hidden layers
44
+ β”‚ β”œβ”€β”€ ReLU as the activation in hidden layer
45
+ β”‚ β”œβ”€β”€ No activation at the output layer
46
+ β”‚ └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
47
+ β”œβ”€β”€ Add and norm (again)
48
+ └── Save encoder out to be used in cross attention
49
+
50
+ Decoder
51
+ β”œβ”€β”€ Decoder teacher text (same as the target text but shifted right)
52
+ β”‚ β”œβ”€β”€ Eg: Decoder teacher text - "<SOS> hello, I'm fine."
53
+ β”‚ └── Eg: target text - "hello, I'm fine. <EOS>"
54
+ β”œβ”€β”€ Remove punctuation from input text
55
+ β”œβ”€β”€ Input tokenization
56
+ β”œβ”€β”€ Embedding lookup with torch.nn.Embedding
57
+ β”œβ”€β”€ Positional encoding (sin, cosine)
58
+ β”œβ”€β”€ Masked-self-attention (single-head, new class signature for masked self attention introduced)
59
+ β”‚ β”œβ”€β”€ single-head
60
+ β”‚ β”œβ”€β”€ causal mask with triangular matrix
61
+ β”‚ β”œβ”€β”€ Q = Wq @ Embedding
62
+ β”‚ β”œβ”€β”€ K = Wk @ Embedding
63
+ β”‚ └── V = Wv @ Embedding
64
+ β”œβ”€β”€ Add and norm
65
+ β”œβ”€β”€ Cross attention (same class signature used in the encoder self-attention can be used)
66
+ β”‚ β”œβ”€β”€ single-head
67
+ β”‚ β”œβ”€β”€ Q = Wq @ Add and normalized output from masked-self-attention
68
+ β”‚ β”œβ”€β”€ K = Wk @ Encoder output
69
+ β”‚ └── V = Wv @ Encoder output
70
+ β”œβ”€β”€ Add and norm
71
+ β”œβ”€β”€ Feed forward layer
72
+ β”‚ β”œβ”€β”€ 2 hidden layers
73
+ β”‚ β”œβ”€β”€ ReLU as the activation in hidden layer
74
+ β”‚ β”œβ”€β”€ No activation at the output layer
75
+ β”‚ └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
76
+ β”œβ”€β”€ Add and norm (again)
77
+ └── Linear layer (No activation or softmax as in 'Attention is all you need' is used here)
78
+
79
+ Optimization
80
+ β”œβ”€β”€ Initialize the Adam optimizer with the model’s parameters and a specified learning rate.
81
+ β”‚ └── self.optimizer = torch.optim.Adam(params=self.parameters, lr=learning_rate)
82
+ β”œβ”€β”€ Before computing gradients for the current batch, we reset any existing gradients from the previous iteration.
83
+ β”‚ └── self.optimizer.zero_grad()
84
+ β”œβ”€β”€ The model takes in `input_tokens` and `decoder_teacher_tokens` and performs a forward pass to compute `logits`
85
+ β”‚ └── logits = self.forward(input_tokens, decoder_teacher_tokens)
86
+ β”œβ”€β”€ The cross-entropy loss
87
+ β”‚ β”œβ”€β”€ Measures the difference between the predicted token distribution (logits) and the actual target tokens (decoder_target_tokens).
88
+ β”‚ β”œβ”€β”€ It expects logits to have raw scores (not probabilities), and it applies softmax internally.
89
+ β”‚ └── loss = F.cross_entropy(logits, decoder_target_tokens)
90
+ β”œβ”€β”€ Compute the gradients of the loss with respect to all trainable parameters in the model using automatic differentiation (backpropagation).
91
+ β”‚ └── loss.backward()
92
+ └── Optimizer updates the model's weights using the computed gradients.
93
+ └── self.optimizer.step()
94
+
95
+ After training, to calculate the output tokens -> text, 'Autoregressive text generation' is used (one word at a time)
96
+ β”œβ”€β”€ Start with <SOS>. (Initial input to the decoder) but input to the encoder is the `prompt`.
97
+ β”œβ”€β”€ Model predicts the next token.
98
+ β”œβ”€β”€ Append the predicted token to the sequence.
99
+ β”œβ”€β”€ Repeat until an <EOS> token or max length is reached.
100
+ └── For illustration let's use words instead of tokens(numerical representation)
101
+ <SOS>
102
+ <SOS> hello
103
+ <SOS> hello I'm
104
+ <SOS> hello I'm good
105
+ <SOS> hello I'm good <EOS>
106
+ ```
107
+ ```
108
+ Feauter Improvements:
109
+ β”œβ”€β”€ Multi-head attention instead of single-head attention.
110
+ β”œβ”€β”€ Layer normalization instead of simple mean-variance normalization.
111
+ └── Dropout layers for better generalization.
112
+ ```