Spaces:
Sleeping
Sleeping
File size: 6,286 Bytes
d86a963 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | # Decoder
In this section we will be implementing the decoder.
<img src="assets/definitions/decoder.png" width=60% >
The **decoder** section also has their own multipleencoder layers to build context of the language (french languge in my example).
Althouhg the encoder also adds in another layer after, that performs multi-head attention ([visit encoder read me for def of attention](/encoder_transformer/README.md)) over the output of the encoder stack.
Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
# Decoder:
[Simple video](https://www.youtube.com/watch?v=wjZofJX0v4M)
**Transformers** are a type of deep neural network architecture that takes a sequence of data (such as text), understands how each element in the sequence relates to the others and predicts what comes next.
### How it works
In large language models (LLMs), a transformer understands how words in a sentence relate to each other so it can capture meaning and generate the next word.
> **"The cat sat ..."**
The model does this by separating the sentence into small sections called **tokens** (sections that can be a full word but not always can be section of a word). Then we embed each input token into a vector (a list of numbers) that captures its meaning.
$$
E =
\begin{bmatrix}
0.2 & 0.4 & 0.1 \\\\
0.6 & 0.1 & 0.8 \\\\
0.9 & 0.7 & 0.3
\end{bmatrix}
The =
\begin{bmatrix}
0.2 & 0.4 & 0.1
\end{bmatrix}
cat=
\begin{bmatrix}
0.6 & 0.1 & 0.8
\end{bmatrix}
sat=
\begin{bmatrix}
0.9 & 0.7 & 0.3
\end{bmatrix}
$$
#### Transformer Blocks
The transformer blocks, decides which token are most relevant to each other (Self-Attention layer) and then refines and transform the information (Feedfoward Neural Network).
<img src="assets/definitions/transformer.png" width="50%" alt="Transformers definition">
##### Self Attention
Each token creates three versions of itself:
- **Query (Q):** What it’s looking for
- **Key (K):** What it offers
- **Value (V):** Its meaning
$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$
Then, every word compares its **Query** with every other **Key**:
$$
\text{Scores} = QK^T
$$
Higher scores mean the words are more related. Finally, each word mixes the information from all others using these weights:
$$
\text{Output} = \text{Attention} \times V
$$
#### Feedforward Neural Network + Normalize
After **Self-Attention**, each word’s vector now contains context.
The **Feedforward Layer** helps the model process that information more deeply using a small two-layer network:
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
Then, the model adds the original input back to the output (a **residual connection**) and **normalizes** it for stability:
$$
\text{Output} = \text{LayerNorm}(x + \text{FFN}(x))
$$
### Language Modeling Head
After the transformer finishes processing,
each word has a final **vector** that captures its full meaning and context.
Now the model needs to turn those vectors into actual **predicted words**.
Each word is now just a list of numbers (a vector) — for example:
| Word | Vector (example) |
|:-----|:----------------:|
| The | [0.23, -0.11, 0.77, 0.52, ...] |
| king | [0.45, 0.84, -0.31, 0.09, ...] |
These vectors are the **input** to the language modeling head.
---
#### ⚙️ What happens next
The model multiplies these vectors by a large **vocabulary matrix**
(one row for every possible word it can predict).
This gives a **score** for each possible next word.
Then it uses **softmax** to turn those scores into **probabilities**
that add up to 1.
---
#### 🗣️ Output Example
If the model just saw the phrase “The”, it might predict:
| Next Word | Probability |
|:-----------|:-------------:|
| king | 0.65 |
| cat | 0.18 |
| apple | 0.06 |
| sat | 0.02 |
The word with the **highest probability** ("king") is chosen as the next word.
<img src="assets/definitions/transformers.jpeg" width="50%" alt="Transformers definition">
# Applying:
**Input** : Tiny ShakespeaRe
**Train**
- Run: `python3 training.py`
- Device auto-detects `mps` on Apple Silicon; falls back to `cpu`.
- The script now saves a checkpoint at `assets/checkpoints/gpt-YYYYmmdd-HHMMSS.pt` and a convenient copy at `assets/checkpoints/latest.pt`.
**Generate**
- After training, sample text with:
- `python3 sample.py --prompt "ROMEO:" --max_new_tokens 300`
- If needed, specify device: `--device cpu` or `--device mps`.
- To use a specific checkpoint: `--ckpt assets/checkpoints/gpt-20241017-153800.pt`.
Notes
- Prompts should use characters seen during training (Tiny Shakespeare) for best results.
- Checkpoint includes model hyperparameters and the character vocabulary, so generation does not require the training data.
**Resume + Checkpoints**
- Training now saves periodic checkpoints every `--save_interval` steps (default = `eval_interval`).
- Latest checkpoint path: `assets/checkpoints/latest.pt`.
- Resume training:
- `python3 training.py --resume` (auto picks `latest.pt` if present)
- Or specify: `python3 training.py --resume --ckpt assets/checkpoints/gpt-YYYYmmdd-HHMMSS-step3000.pt`
- Safe interrupt: Press Ctrl+C; the script saves a checkpoint at the next safe point and exits.
## Steps and connections
- Implement `decoder.py` with embeddings, masked self-attn, cross-attn, FFN blocks.
- Inputs: `tgt_input_ids`, `encoder_hidden_states`, masks → logits.
- Tie decoder token embedding with LM head for efficiency.
- Use causal mask internally; mask PAD tokens in self-attention too.
- Connect in `machine_translation/model.py` as the generation component.
Decoder (French:
Stack of N decoder blocks, each block has:
Masked self-attention on target tokens (causal mask).
Cross-attention over encoder_hidden_states (K,V from encoder; Q from decoder).
**Goal:** Build a decoder (GPT) from scratch
Extra resources to help out: |