AashishAIHub commited on
Commit
797e92e
·
1 Parent(s): 3ccb99b

feat: integrate PDF book notes into DL modules

Browse files

- Optimizers: Added AdaGrad, comparison table, 10 interview questions
- Activation Functions: Added Swish, GELU, Dead Neurons, How-to-Choose, 10 IQs
- Loss Functions: Added Huber Loss, Hinge Loss, comparison table, 10 IQs
- Backpropagation: Added Forward Prop, Training Pipeline, Training Terms, 10 IQs
- Regularization: Added Weight Init (Xavier/He), Dropout vs BatchNorm, 10 IQs

Total: 50 interview questions from DeppLEarning.pdf integrated

Files changed (1) hide show
  1. DeepLearning/index.html +588 -50
DeepLearning/index.html CHANGED
@@ -943,6 +943,18 @@
943
  <td>Multi-class output</td>
944
  <td>Computationally expensive</td>
945
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
946
  </table>
947
  `,
948
  concepts: `
@@ -972,6 +984,49 @@
972
  • Try <strong>Leaky ReLU</strong> or <strong>ELU</strong> if ReLU neurons are dying<br>
973
  • Avoid Sigmoid/Tanh in deep networks (gradient vanishing)
974
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
975
  `,
976
  applications: `
977
  <div class="info-box">
@@ -986,6 +1041,20 @@
986
  Different tasks need different outputs: Sigmoid for binary, Softmax for multi-class, Linear for regression
987
  </div>
988
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
989
  `,
990
  math: `
991
  <h3>Derivatives: The Backprop Fuel</h3>
@@ -2068,11 +2137,27 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2068
  <h3>Common Loss Functions</h3>
2069
  <div class="list-item">
2070
  <div class="list-num">01</div>
2071
- <div><strong>MSE:</strong> (1/n)Σ(y - ŷ)² - Penalizes large errors</div>
2072
  </div>
2073
  <div class="list-item">
2074
  <div class="list-num">02</div>
2075
- <div><strong>Cross-Entropy:</strong> (y·log(ŷ)) - For classification</div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2076
  </div>
2077
  `,
2078
  applications: `
@@ -2088,6 +2173,31 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2088
  Business-specific objectives: Focal Loss (imbalanced data), Dice Loss (segmentation), Contrastive Loss (similarity learning)
2089
  </div>
2090
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2091
  `,
2092
  math: `
2093
  <h3>Binary Cross-Entropy (BCE) Derivation</h3>
@@ -2097,6 +2207,25 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2097
  L(ŷ, y) = -(y log(ŷ) + (1-y) log(1-ŷ))
2098
  </div>
2099
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2100
  <h3>Paper & Pain: Why not MSE for Classification?</h3>
2101
  <p>If we use MSE for sigmoid output, the gradient is:</p>
2102
  <div class="formula">
@@ -2171,6 +2300,30 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2171
  CNNs: SGD+Momentum | Transformers: AdamW | RNNs: RMSprop | Default: Adam
2172
  </div>
2173
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2174
  `,
2175
  math: `
2176
  <h3>Gradient Descent: The Foundation</h3>
@@ -2210,8 +2363,27 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2210
  Momentum accumulates past gradients for faster convergence.
2211
  </div>
2212
 
2213
- <h3>3. RMSprop</h3>
2214
- <p>Adapts learning rate per-parameter using running average of squared gradients.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2215
 
2216
  <div class="formula">
2217
  v_t = β × v_{t-1} + (1-β) × (∇L)²<br>
@@ -2219,8 +2391,8 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2219
  β = 0.9, ε = 1e-8 (numerical stability)
2220
  </div>
2221
 
2222
- <h3>4. Adam (Adaptive Moment Estimation)</h3>
2223
- <p>Combines momentum AND adaptive learning rates. The most popular optimizer.</p>
2224
 
2225
  <div class="formula" style="background: rgba(255, 107, 53, 0.08); padding: 20px; border-radius: 8px;">
2226
  <strong>Step 1 - First Moment (Momentum):</strong><br>
@@ -2254,15 +2426,40 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2254
  },
2255
  "backprop": {
2256
  overview: `
2257
- <h3>Backpropagation Algorithm</h3>
2258
- <p>Backprop efficiently computes gradients by applying the chain rule from output to input, enabling training of deep networks.</p>
2259
 
2260
- <h3>Why Backpropagation?</h3>
2261
- <ul>
2262
- <li><strong>Efficient:</strong> Computes all gradients in single backward pass</li>
2263
- <li><strong>Scalable:</strong> Works for networks of any depth</li>
2264
- <li><strong>Automatic:</strong> Modern frameworks do it automatically</li>
2265
- </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2266
  `,
2267
  concepts: `
2268
  <div class="formula">
@@ -2287,6 +2484,20 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2287
  PyTorch, TensorFlow implement automatic backprop - you define forward pass, framework does backward
2288
  </div>
2289
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2290
  `,
2291
  math: `
2292
  <h3>The 4 Fundamental Equations of Backprop</h3>
@@ -2364,6 +2575,16 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2364
  <td>Computer vision (rotations, flips, crops)</td>
2365
  </tr>
2366
  </table>
 
 
 
 
 
 
 
 
 
 
2367
  `,
2368
  applications: `
2369
  <div class="info-box">
@@ -2375,6 +2596,29 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2375
  • Data Augmentation for images
2376
  </div>
2377
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2378
  `,
2379
  math: `
2380
  <h3>L2 Regularization (Weight Decay)</h3>
@@ -2442,6 +2686,30 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
2442
  (where n = number of neurons that can be dropped)<br><br>
2443
  Each forward pass is a different architecture!
2444
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2445
  `
2446
  },
2447
  "batch-norm": {
@@ -4128,82 +4396,352 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
4128
  },
4129
  "bert": {
4130
  overview: `
4131
- <h3>BERT (Bidirectional Encoder Representations from Transformers)</h3>
4132
- <p>Pre-trained encoder-only Transformer for understanding language (not generation).</p>
 
4133
 
4134
- <h3>Key Innovation: Bidirectional Context</h3>
4135
- <p>Unlike GPT (left-to-right), BERT sees both left AND right context simultaneously.</p>
 
 
4136
 
4137
- <h3>Pre-training Tasks</h3>
4138
  <ul>
4139
- <li><strong>Masked Language Modeling:</strong> Mask 15% of tokens, predict them (e.g., "The cat [MASK] on the mat" → predict "sat")</li>
4140
- <li><strong>Next Sentence Prediction:</strong> Predict if sentence B follows A</li>
 
 
4141
  </ul>
4142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4143
  <div class="callout tip">
4144
- <div class="callout-title">💡 Fine-tuning BERT</div>
4145
- 1. Start with pre-trained BERT (trained on billions of words)<br>
4146
- 2. Add task-specific head (classification, QA, NER)<br>
4147
- 3. Fine-tune on your dataset (10K-100K examples)<br>
4148
- 4. Achieves SOTA with minimal data!
 
4149
  </div>
4150
  `,
4151
  concepts: `
4152
  <h3>BERT Architecture</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4153
  <div class="list-item">
4154
  <div class="list-num">01</div>
4155
- <div><strong>Encoder Only:</strong> 12/24 Transformer encoder layers (BERT-base/large)</div>
4156
  </div>
4157
  <div class="list-item">
4158
  <div class="list-num">02</div>
4159
- <div><strong>Token Embedding:</strong> WordPiece tokenization (30K vocab)</div>
4160
  </div>
4161
  <div class="list-item">
4162
  <div class="list-num">03</div>
4163
- <div><strong>Segment Embedding:</strong> Distinguish sentence A from sentence B</div>
4164
  </div>
4165
- <div class="list-item">
4166
- <div class="list-num">04</div>
4167
- <div><strong>[CLS] Token:</strong> Aggregated representation for classification tasks</div>
 
 
 
 
 
 
 
 
 
 
4168
  </div>
4169
 
4170
- <h3>Model Sizes</h3>
 
 
 
 
 
 
 
 
 
4171
  <table>
4172
- <tr><th>Model</th><th>Layers</th><th>Hidden</th><th>Params</th></tr>
4173
- <tr><td>BERT-base</td><td>12</td><td>768</td><td>110M</td></tr>
4174
- <tr><td>BERT-large</td><td>24</td><td>1024</td><td>340M</td></tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4175
  </table>
4176
  `,
4177
  math: `
 
 
 
 
 
 
 
4178
  <h3>Masked Language Modeling (MLM)</h3>
4179
- <p>BERT's main pre-training objective:</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4180
 
4181
  <div class="formula">
4182
- L_MLM = log P(x_masked | x_visible)<br>
 
 
 
 
 
 
 
 
4183
  <br>
4184
- For each masked token, predict using cross-entropy loss
 
 
 
 
 
4185
  </div>
4186
 
4187
- <div class="callout insight">
4188
- <div class="callout-title">📝 Paper & Pain: Masking Strategy</div>
4189
- Of the 15% tokens selected for masking:<br>
4190
- 80% [MASK] token<br>
4191
- 10% Random token<br>
4192
- • 10% → Keep original<br>
4193
- This prevents over-reliance on [MASK] during fine-tuning!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4194
  </div>
4195
  `,
4196
  applications: `
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4197
  <div class="info-box">
4198
- <div class="box-title">🔍 Search & QA</div>
4199
  <div class="box-content">
4200
- <strong>Google Search:</strong> Uses BERT for understanding queries<br>
4201
- Question answering systems, document retrieval
4202
  </div>
4203
  </div>
 
 
 
 
 
 
 
 
4204
  <div class="info-box">
4205
  <div class="box-title">📊 Text Classification</div>
4206
- <div class="box-content">Sentiment analysis, topic classification, spam detection</div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4207
  </div>
4208
  `
4209
  },
 
943
  <td>Multi-class output</td>
944
  <td>Computationally expensive</td>
945
  </tr>
946
+ <tr>
947
+ <td>GELU</td>
948
+ <td>(-0.17, ∞)</td>
949
+ <td>Transformers (BERT, GPT)</td>
950
+ <td>Computationally expensive</td>
951
+ </tr>
952
+ <tr>
953
+ <td>Swish</td>
954
+ <td>(-0.28, ∞)</td>
955
+ <td>Deep networks (40+ layers)</td>
956
+ <td>Slightly slower than ReLU</td>
957
+ </tr>
958
  </table>
959
  `,
960
  concepts: `
 
984
  • Try <strong>Leaky ReLU</strong> or <strong>ELU</strong> if ReLU neurons are dying<br>
985
  • Avoid Sigmoid/Tanh in deep networks (gradient vanishing)
986
  </div>
987
+
988
+ <div class="callout warning">
989
+ <div class="callout-title">⚠️ Dead Neurons (Dying ReLU Problem)</div>
990
+ When a neuron's input is always negative, ReLU outputs 0 and its gradient is 0.<br>
991
+ The neuron <strong>never updates</strong> — it's permanently "dead".<br><br>
992
+ <strong>Solutions:</strong><br>
993
+ • Use <strong>Leaky ReLU</strong> (small slope for negative values)<br>
994
+ • Use <strong>ELU</strong> (exponential for negative values)<br>
995
+ • Careful weight initialization (He Initialization)
996
+ </div>
997
+
998
+ <h3>GELU (Gaussian Error Linear Unit)</h3>
999
+ <p>Used in <strong>Transformers, BERT, and GPT</strong>. GELU multiplies the input by the probability that it's positive under a Gaussian distribution.</p>
1000
+ <div class="formula">
1001
+ GELU(x) = x × Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
1002
+ </div>
1003
+
1004
+ <h3>Swish (Self-Gated Activation)</h3>
1005
+ <p>Developed by Google researchers. Consistently matches or outperforms ReLU on deep networks.</p>
1006
+ <div class="formula">
1007
+ Swish(x) = x × σ(βx) where σ = Sigmoid, β = learnable parameter
1008
+ </div>
1009
+ <div class="callout tip">
1010
+ <div class="callout-title">💡 Why Swish is Better</div>
1011
+ • <strong>Smooth:</strong> Doesn't abruptly change direction like ReLU at x=0<br>
1012
+ • <strong>Non-monotonous:</strong> Small negative values preserved (not zeroed like ReLU)<br>
1013
+ • <strong>Unbounded above, bounded below:</strong> Best of both worlds<br>
1014
+ • Best for networks with depth > 40 layers
1015
+ </div>
1016
+
1017
+ <h3>How to Choose Activation Functions</h3>
1018
+ <table>
1019
+ <tr><th>Layer / Task</th><th>Recommended</th></tr>
1020
+ <tr><td>Hidden layers (default)</td><td>ReLU</td></tr>
1021
+ <tr><td>Regression output</td><td>Linear (no activation)</td></tr>
1022
+ <tr><td>Binary classification output</td><td>Sigmoid</td></tr>
1023
+ <tr><td>Multi-class classification</td><td>Softmax</td></tr>
1024
+ <tr><td>Multi-label classification</td><td>Sigmoid</td></tr>
1025
+ <tr><td>CNN hidden layers</td><td>ReLU</td></tr>
1026
+ <tr><td>RNN hidden layers</td><td>Tanh / Sigmoid</td></tr>
1027
+ <tr><td>Transformers</td><td>GELU</td></tr>
1028
+ <tr><td>Deep networks (40+ layers)</td><td>Swish</td></tr>
1029
+ </table>
1030
  `,
1031
  applications: `
1032
  <div class="info-box">
 
1041
  Different tasks need different outputs: Sigmoid for binary, Softmax for multi-class, Linear for regression
1042
  </div>
1043
  </div>
1044
+
1045
+ <div class="callout tip">
1046
+ <div class="callout-title">🎤 Probable Interview Questions</div>
1047
+ 1. Why do we need activation functions?<br>
1048
+ 2. What is vanishing gradient?<br>
1049
+ 3. Why is ReLU preferred over sigmoid?<br>
1050
+ 4. What are dead neurons?<br>
1051
+ 5. Difference between ReLU and Leaky ReLU?<br>
1052
+ 6. Why softmax instead of sigmoid for multiclass?<br>
1053
+ 7. Why linear activation for regression output?<br>
1054
+ 8. Why GELU is used in transformers?<br>
1055
+ 9. Can activation function affect convergence speed?<br>
1056
+ 10. What happens if we remove activation functions?
1057
+ </div>
1058
  `,
1059
  math: `
1060
  <h3>Derivatives: The Backprop Fuel</h3>
 
2137
  <h3>Common Loss Functions</h3>
2138
  <div class="list-item">
2139
  <div class="list-num">01</div>
2140
+ <div><strong>MSE:</strong> (1/n)Σ(y - ŷ)² - Penalizes large errors, sensitive to outliers</div>
2141
  </div>
2142
  <div class="list-item">
2143
  <div class="list-num">02</div>
2144
+ <div><strong>MAE:</strong> (1/n)Σ|y - ŷ| - Robust to outliers, constant gradient, slower convergence</div>
2145
+ </div>
2146
+ <div class="list-item">
2147
+ <div class="list-num">03</div>
2148
+ <div><strong>Huber Loss:</strong> MSE when |error| ≤ δ, MAE otherwise. Best of both — smooth + robust to outliers</div>
2149
+ </div>
2150
+ <div class="list-item">
2151
+ <div class="list-num">04</div>
2152
+ <div><strong>BCE (Binary Cross-Entropy):</strong> -[y·log(ŷ) + (1-y)·log(1-ŷ)] - Used with Sigmoid</div>
2153
+ </div>
2154
+ <div class="list-item">
2155
+ <div class="list-num">05</div>
2156
+ <div><strong>CCE (Categorical Cross-Entropy):</strong> -Σ y·log(ŷ) - Used with Softmax for multi-class</div>
2157
+ </div>
2158
+ <div class="list-item">
2159
+ <div class="list-num">06</div>
2160
+ <div><strong>Hinge Loss:</strong> max(0, 1 - y·ŷ) where y ∈ {-1, +1} - Margin-based, SVM-style</div>
2161
  </div>
2162
  `,
2163
  applications: `
 
2173
  Business-specific objectives: Focal Loss (imbalanced data), Dice Loss (segmentation), Contrastive Loss (similarity learning)
2174
  </div>
2175
  </div>
2176
+
2177
+ <h3>Loss Function Comparison</h3>
2178
+ <table>
2179
+ <tr><th>Loss</th><th>Type</th><th>Outlier Sensitivity</th><th>Key Property</th></tr>
2180
+ <tr><td>MSE</td><td>Regression</td><td>High</td><td>Penalizes large errors heavily</td></tr>
2181
+ <tr><td>MAE</td><td>Regression</td><td>Low</td><td>Robust, constant gradient</td></tr>
2182
+ <tr><td>Huber</td><td>Regression</td><td>Medium</td><td>Smooth + robust (MSE+MAE combo)</td></tr>
2183
+ <tr><td>BCE</td><td>Binary Class.</td><td>High</td><td>Strong gradients for wrong predictions</td></tr>
2184
+ <tr><td>CCE</td><td>Multi-class</td><td>High</td><td>Outputs probabilities via Softmax</td></tr>
2185
+ <tr><td>Hinge</td><td>Binary Class.</td><td>Medium</td><td>Margin-based, less probabilistic</td></tr>
2186
+ </table>
2187
+
2188
+ <div class="callout tip">
2189
+ <div class="callout-title">🎤 Probable Interview Questions</div>
2190
+ 1. Difference between MSE and MAE?<br>
2191
+ 2. Why Huber loss is preferred sometimes?<br>
2192
+ 3. Why BCE with sigmoid?<br>
2193
+ 4. Why softmax with CCE?<br>
2194
+ 5. Why can't we use MSE for classification?<br>
2195
+ 6. What is Hinge loss and where is it used?<br>
2196
+ 7. Difference between loss function and evaluation metric?<br>
2197
+ 8. How does loss choice affect gradients?<br>
2198
+ 9. What is Focal Loss and when to use it?<br>
2199
+ 10. Can we design custom loss functions?
2200
+ </div>
2201
  `,
2202
  math: `
2203
  <h3>Binary Cross-Entropy (BCE) Derivation</h3>
 
2207
  L(ŷ, y) = -(y log(ŷ) + (1-y) log(1-ŷ))
2208
  </div>
2209
 
2210
+ <h3>Huber Loss (Smooth MAE)</h3>
2211
+ <p>Combines MSE for small errors and MAE for large errors using threshold δ:</p>
2212
+ <div class="formula">
2213
+ L = ½(y - ŷ)² &nbsp;&nbsp;&nbsp; when |y - ŷ| ≤ δ<br>
2214
+ L = δ|y - ŷ| - ½δ² &nbsp;&nbsp; otherwise
2215
+ </div>
2216
+ <div class="callout insight">
2217
+ <div class="callout-title">📝 Paper & Pain: Huber Intuition</div>
2218
+ <strong>Small error (|error| ≤ δ):</strong> Behaves like MSE — smooth, differentiable<br>
2219
+ <strong>Large error (|error| > δ):</strong> Behaves like MAE — doesn't blow up for outliers<br><br>
2220
+ Best of both worlds! Used when data contains mild outliers.
2221
+ </div>
2222
+
2223
+ <h3>Hinge Loss (SVM-style)</h3>
2224
+ <div class="formula">
2225
+ L = (1/n) Σ max(0, 1 - y·ŷ) &nbsp;&nbsp; where y ∈ {-1, +1}
2226
+ </div>
2227
+ <p>Margin-based loss: only penalizes predictions within the margin boundary. Used in SVMs and some neural network classifiers.</p>
2228
+
2229
  <h3>Paper & Pain: Why not MSE for Classification?</h3>
2230
  <p>If we use MSE for sigmoid output, the gradient is:</p>
2231
  <div class="formula">
 
2300
  CNNs: SGD+Momentum | Transformers: AdamW | RNNs: RMSprop | Default: Adam
2301
  </div>
2302
  </div>
2303
+
2304
+ <h3>Optimizer Comparison</h3>
2305
+ <table>
2306
+ <tr><th>Optimizer</th><th>Key Idea</th><th>Problem</th></tr>
2307
+ <tr><td>SGD</td><td>Simple, fast</td><td>Noisy convergence</td></tr>
2308
+ <tr><td>Momentum</td><td>Smooths updates</td><td>Needs tuning</td></tr>
2309
+ <tr><td>AdaGrad</td><td>Adaptive LR</td><td>LR shrinks too much</td></tr>
2310
+ <tr><td>RMSProp</td><td>Fixes AdaGrad</td><td>No momentum</td></tr>
2311
+ <tr><td><strong>Adam</strong></td><td><strong>Best of all</strong></td><td>Slightly more computation</td></tr>
2312
+ </table>
2313
+
2314
+ <div class="callout tip">
2315
+ <div class="callout-title">🎤 Probable Interview Questions</div>
2316
+ 1. Difference between optimizer and gradient descent?<br>
2317
+ 2. Why does SGD oscillate?<br>
2318
+ 3. Why does AdaGrad fail in deep networks?<br>
2319
+ 4. How does RMSProp fix AdaGrad?<br>
2320
+ 5. Why is bias correction needed in Adam?<br>
2321
+ 6. What happens if learning rate is too high?<br>
2322
+ 7. When would you prefer SGD over Adam?<br>
2323
+ 8. What is momentum intuitively?<br>
2324
+ 9. Why is Adam the default choice?<br>
2325
+ 10. Can Adam overfit?
2326
+ </div>
2327
  `,
2328
  math: `
2329
  <h3>Gradient Descent: The Foundation</h3>
 
2363
  Momentum accumulates past gradients for faster convergence.
2364
  </div>
2365
 
2366
+ <h3>3. AdaGrad (Adaptive Gradient)</h3>
2367
+ <p>Adapts learning rate per-parameter based on how frequently each parameter is updated.</p>
2368
+
2369
+ <div class="formula">
2370
+ <strong>Accumulated Gradient:</strong><br>
2371
+ G_t = G_{t-1} + (∇L)²<br><br>
2372
+ <strong>Update Rule:</strong><br>
2373
+ w_{t+1} = w_t - η / √(G_t + ε) × ∇L<br><br>
2374
+ Where ε = 1e-8 (numerical stability)
2375
+ </div>
2376
+
2377
+ <div class="callout insight">
2378
+ <div class="callout-title">📝 Paper & Pain: AdaGrad Intuition</div>
2379
+ <strong>Frequent parameters</strong> → G_t grows fast → learning rate shrinks<br>
2380
+ <strong>Rare parameters</strong> → G_t stays small → learning rate stays large<br><br>
2381
+ <strong>Problem:</strong> G_t only accumulates (never forgets), so learning rate keeps shrinking and training may stop early!<br>
2382
+ <strong>This is exactly why RMSprop was invented →</strong>
2383
+ </div>
2384
+
2385
+ <h3>4. RMSprop (Root Mean Square Propagation)</h3>
2386
+ <p>Fixes AdaGrad's shrinking problem by using a <strong>decaying average</strong> of recent squared gradients instead of summing all.</p>
2387
 
2388
  <div class="formula">
2389
  v_t = β × v_{t-1} + (1-β) × (∇L)²<br>
 
2391
  β = 0.9, ε = 1e-8 (numerical stability)
2392
  </div>
2393
 
2394
+ <h3>5. Adam (Adaptive Moment Estimation)</h3>
2395
+ <p>Combines momentum (from SGD) AND adaptive learning rates (from RMSprop). The most popular optimizer.</p>
2396
 
2397
  <div class="formula" style="background: rgba(255, 107, 53, 0.08); padding: 20px; border-radius: 8px;">
2398
  <strong>Step 1 - First Moment (Momentum):</strong><br>
 
2426
  },
2427
  "backprop": {
2428
  overview: `
2429
+ <h3>Forward & Backpropagation</h3>
2430
+ <p>The neural network training loop consists of two passes: <strong>forward propagation</strong> (compute predictions) and <strong>backpropagation</strong> (compute gradients for updates).</p>
2431
 
2432
+ <h3>Forward Propagation</h3>
2433
+ <p>The process of moving inputs through the network to produce an output:</p>
2434
+ <div class="formula">
2435
+ Input Weighted Sum Activation → Output
2436
+ </div>
2437
+ <p>This happens: for every batch, in every epoch, before computing loss.</p>
2438
+
2439
+ <h3>Training Pipeline</h3>
2440
+ <table>
2441
+ <tr><th>Component</th><th>Role</th></tr>
2442
+ <tr><td>Forward Propagation</td><td>Computes predictions</td></tr>
2443
+ <tr><td>Loss Function</td><td>Computes error</td></tr>
2444
+ <tr><td>Backpropagation</td><td>Computes gradients</td></tr>
2445
+ <tr><td>Gradient Descent</td><td>Updates weights</td></tr>
2446
+ </table>
2447
+
2448
+ <div class="callout warning">
2449
+ <div class="callout-title">⚠️ Key Distinction</div>
2450
+ Backpropagation does <strong>NOT</strong> update weights — it only computes gradients.<br>
2451
+ <strong>Gradient Descent</strong> (or any optimizer) does the actual weight update!
2452
+ </div>
2453
+
2454
+ <h3>Training Terminologies</h3>
2455
+ <table>
2456
+ <tr><th>Term</th><th>Meaning</th><th>Example (1000 samples, batch=100)</th></tr>
2457
+ <tr><td>Batch</td><td>Subset of data</td><td>100 samples</td></tr>
2458
+ <tr><td>Batch Size</td><td>Samples per batch</td><td>100</td></tr>
2459
+ <tr><td>Steps per Epoch</td><td>Total / Batch Size</td><td>1000/100 = 10</td></tr>
2460
+ <tr><td>Iteration</td><td>One batch update</td><td>1 step</td></tr>
2461
+ <tr><td>Epoch</td><td>One full pass of dataset</td><td>10 iterations</td></tr>
2462
+ </table>
2463
  `,
2464
  concepts: `
2465
  <div class="formula">
 
2484
  PyTorch, TensorFlow implement automatic backprop - you define forward pass, framework does backward
2485
  </div>
2486
  </div>
2487
+
2488
+ <div class="callout tip">
2489
+ <div class="callout-title">🎤 Probable Interview Questions</div>
2490
+ 1. What is the role of bias in a perceptron?<br>
2491
+ 2. Why can't we use MSE for classification?<br>
2492
+ 3. Difference between loss function and evaluation metric?<br>
2493
+ 4. Why is mini-batch GD preferred?<br>
2494
+ 5. Does backpropagation update weights?<br>
2495
+ 6. Can gradient descent work without backpropagation?<br>
2496
+ 7. What happens if learning rate is too high?<br>
2497
+ 8. How many times does forward propagation occur per epoch?<br>
2498
+ 9. What happens if we remove bias?<br>
2499
+ 10. What is the chain rule and why is it essential for backprop?
2500
+ </div>
2501
  `,
2502
  math: `
2503
  <h3>The 4 Fundamental Equations of Backprop</h3>
 
2575
  <td>Computer vision (rotations, flips, crops)</td>
2576
  </tr>
2577
  </table>
2578
+
2579
+ <h3>Weight Initialization</h3>
2580
+ <p>Proper initialization prevents vanishing/exploding gradients from the very first step.</p>
2581
+ <table>
2582
+ <tr><th>Method</th><th>Formula</th><th>Best For</th></tr>
2583
+ <tr><td>Zero Init</td><td>All w = 0</td><td>❌ Never use! Breaks symmetry</td></tr>
2584
+ <tr><td>Random</td><td>w ~ N(0, 0.01)</td><td>⚠️ Vanishes in deep nets</td></tr>
2585
+ <tr><td><strong>Xavier (Glorot)</strong></td><td>w ~ N(0, √(2/(n_in + n_out)))</td><td>✅ Sigmoid, Tanh</td></tr>
2586
+ <tr><td><strong>He (Kaiming)</strong></td><td>w ~ N(0, √(2/n_in))</td><td>✅ ReLU (default)</td></tr>
2587
+ </table>
2588
  `,
2589
  applications: `
2590
  <div class="info-box">
 
2596
  • Data Augmentation for images
2597
  </div>
2598
  </div>
2599
+
2600
+ <h3>Dropout vs Batch Normalization</h3>
2601
+ <table>
2602
+ <tr><th>Feature</th><th>Dropout</th><th>Batch Normalization</th></tr>
2603
+ <tr><td>Purpose</td><td>Regularization</td><td>Faster training + mild regularization</td></tr>
2604
+ <tr><td>Mechanism</td><td>Randomly drops neurons</td><td>Normalizes layer inputs</td></tr>
2605
+ <tr><td>Training vs Test</td><td>Different behavior</td><td>Different behavior</td></tr>
2606
+ <tr><td>Combined?</td><td colspan="2">Yes, use BatchNorm <em>before</em> Dropout</td></tr>
2607
+ </table>
2608
+
2609
+ <div class="callout tip">
2610
+ <div class="callout-title">🎤 Probable Interview Questions</div>
2611
+ 1. Why can't we initialize all weights to zero?<br>
2612
+ 2. Difference between Xavier and He initialization?<br>
2613
+ 3. What is the vanishing gradient problem?<br>
2614
+ 4. How does Dropout prevent overfitting?<br>
2615
+ 5. Can we use Dropout at test time?<br>
2616
+ 6. Why is He initialization used with ReLU?<br>
2617
+ 7. What happens if weights are too large initially?<br>
2618
+ 8. Does Batch Normalization eliminate the need for Dropout?<br>
2619
+ 9. L1 vs L2 regularization — when to use each?<br>
2620
+ 10. What is the exploding gradient problem and how to fix it?
2621
+ </div>
2622
  `,
2623
  math: `
2624
  <h3>L2 Regularization (Weight Decay)</h3>
 
2686
  (where n = number of neurons that can be dropped)<br><br>
2687
  Each forward pass is a different architecture!
2688
  </div>
2689
+
2690
+ <h3>Weight Initialization Mathematics</h3>
2691
+
2692
+ <h4>Xavier Initialization (for Sigmoid/Tanh)</h4>
2693
+ <div class="formula">
2694
+ w ~ N(0, σ²) where σ² = 2 / (n_in + n_out)<br><br>
2695
+ Goal: Keep Var(output) ≈ Var(input) across layers
2696
+ </div>
2697
+
2698
+ <h4>He Initialization (for ReLU)</h4>
2699
+ <div class="formula">
2700
+ w ~ N(0, σ²) where σ² = 2 / n_in<br><br>
2701
+ ReLU zeros out ~50% of activations, so variance is halved → multiply by 2 to compensate!
2702
+ </div>
2703
+
2704
+ <div class="callout insight">
2705
+ <div class="callout-title">📝 Paper & Pain: Why Zero Init Fails</div>
2706
+ If all weights = 0, every neuron computes the <strong>same output</strong>.<br>
2707
+ All gradients are <strong>identical</strong> → All weights update the same way.<br>
2708
+ Result: All neurons stay identical forever! The network is as good as <strong>1 neuron</strong>.<br><br>
2709
+ <strong>Random Init:</strong> w ~ N(0, 0.01) works for shallow networks but gradients shrink exponentially in deep ones.<br>
2710
+ <strong>Xavier:</strong> Calibrates variance based on layer width → stable gradients for Sigmoid/Tanh.<br>
2711
+ <strong>He:</strong> Accounts for ReLU zeroing out negative half → default for modern networks.
2712
+ </div>
2713
  `
2714
  },
2715
  "batch-norm": {
 
4396
  },
4397
  "bert": {
4398
  overview: `
4399
+ <h3>BERT: Bidirectional Encoder Representations from Transformers</h3>
4400
+ <p><strong>Paper:</strong> "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018)</p>
4401
+ <p><strong>arXiv:</strong> <a href="https://arxiv.org/abs/1810.04805" target="_blank">1810.04805</a></p>
4402
 
4403
+ <div class="callout insight">
4404
+ <div class="callout-title">🎯 Key Innovation</div>
4405
+ BERT revolutionized NLP by introducing <strong>bidirectional pre-training</strong>. Unlike previous models (GPT, ELMo) that processed text left-to-right or combined shallow bidirectional representations, BERT deeply integrates left AND right context in all layers simultaneously.
4406
+ </div>
4407
 
4408
+ <h3>Why BERT Matters</h3>
4409
  <ul>
4410
+ <li><strong>Transfer Learning for NLP:</strong> Pre-train once on massive unlabeled data, fine-tune on many tasks</li>
4411
+ <li><strong>State-of-the-Art Results:</strong> Set new records on 11 NLP tasks including SQuAD, GLUE, and SWAG</li>
4412
+ <li><strong>Efficiency:</strong> Fine-tuning requires minimal task-specific architecture changes</li>
4413
+ <li><strong>Accessibility:</strong> Google released pre-trained models publicly</li>
4414
  </ul>
4415
 
4416
+ <h3>Pre-training Corpus</h3>
4417
+ <div class="info-box">
4418
+ <div class="box-title">📚 Training Data</div>
4419
+ <div class="box-content">
4420
+ <strong>BooksCorpus:</strong> 800M words from 11,038 unpublished books<br>
4421
+ <strong>English Wikipedia:</strong> 2,500M words (text passages only, no lists/tables/headers)<br>
4422
+ <strong>Total:</strong> ~3.3 billion words
4423
+ </div>
4424
+ </div>
4425
+
4426
+ <h3>Pre-training Tasks</h3>
4427
+ <div class="list-item">
4428
+ <div class="list-num">01</div>
4429
+ <div><strong>Masked Language Modeling (MLM):</strong> Randomly mask 15% of tokens and predict them using bidirectional context. Example: "The cat [MASK] on the mat" → predict "sat"</div>
4430
+ </div>
4431
+ <div class="list-item">
4432
+ <div class="list-num">02</div>
4433
+ <div><strong>Next Sentence Prediction (NSP):</strong> Given sentence pairs (A, B), predict if B actually follows A in the corpus. Helps with tasks like QA and NLI that require understanding sentence relationships.</div>
4434
+ </div>
4435
+
4436
  <div class="callout tip">
4437
+ <div class="callout-title">💡 The BERT Fine-tuning Paradigm</div>
4438
+ 1. <strong>Pre-train</strong> BERT on BooksCorpus + Wikipedia (days/weeks on TPUs)<br>
4439
+ 2. <strong>Download</strong> pre-trained weights from Google<br>
4440
+ 3. <strong>Add</strong> task-specific head (1 layer for classification/QA/NER)<br>
4441
+ 4. <strong>Fine-tune</strong> entire model on your dataset (hours on single GPU)<br>
4442
+ 5. <strong>Achieve SOTA</strong> with as few as 3,600 labeled examples!
4443
  </div>
4444
  `,
4445
  concepts: `
4446
  <h3>BERT Architecture</h3>
4447
+ <p>BERT uses a multi-layer bidirectional Transformer encoder based on Vaswani et al. (2017).</p>
4448
+
4449
+ <h3>Model Variants</h3>
4450
+ <table>
4451
+ <tr>
4452
+ <th>Model</th>
4453
+ <th>Layers (L)</th>
4454
+ <th>Hidden Size (H)</th>
4455
+ <th>Attention Heads (A)</th>
4456
+ <th>Parameters</th>
4457
+ </tr>
4458
+ <tr>
4459
+ <td>BERT<sub>BASE</sub></td>
4460
+ <td>12</td>
4461
+ <td>768</td>
4462
+ <td>12</td>
4463
+ <td>110M</td>
4464
+ </tr>
4465
+ <tr>
4466
+ <td>BERT<sub>LARGE</sub></td>
4467
+ <td>24</td>
4468
+ <td>1024</td>
4469
+ <td>16</td>
4470
+ <td>340M</td>
4471
+ </tr>
4472
+ </table>
4473
+
4474
+ <p><em>Note: BERT<sub>BASE</sub> was designed to match GPT's size for comparison.</em></p>
4475
+
4476
+ <h3>Input Representation</h3>
4477
+ <p>BERT's input embedding is the sum of three components:</p>
4478
+
4479
  <div class="list-item">
4480
  <div class="list-num">01</div>
4481
+ <div><strong>Token Embeddings:</strong> WordPiece tokenization with 30,000 token vocabulary. Handles unknown words by splitting into subwords (e.g., "playing" → "play" + "##ing")</div>
4482
  </div>
4483
  <div class="list-item">
4484
  <div class="list-num">02</div>
4485
+ <div><strong>Segment Embeddings:</strong> Learned embedding to distinguish sentence A from sentence B (E<sub>A</sub> or E<sub>B</sub>)</div>
4486
  </div>
4487
  <div class="list-item">
4488
  <div class="list-num">03</div>
4489
+ <div><strong>Position Embeddings:</strong> Learned positional encodings (unlike Transformers' sinusoidal), supports sequences up to 512 tokens</div>
4490
  </div>
4491
+
4492
+ <div class="formula">
4493
+ Input = Token_Embedding + Segment_Embedding + Position_Embedding
4494
+ </div>
4495
+
4496
+ <h3>Special Tokens</h3>
4497
+ <div class="info-box">
4498
+ <div class="box-title">🏷️ Special Token Usage</div>
4499
+ <div class="box-content">
4500
+ <strong>[CLS]:</strong> Prepended to every input. Final hidden state used for classification tasks<br>
4501
+ <strong>[SEP]:</strong> Separates sentence pairs and marks sequence end<br>
4502
+ <strong>[MASK]:</strong> Replaces masked tokens during pre-training (not used during fine-tuning)
4503
+ </div>
4504
  </div>
4505
 
4506
+ <h4>Example Input Format</h4>
4507
+ <div class="formula">
4508
+ [CLS] My dog is cute [SEP] He likes playing [SEP]<br>
4509
+ <br>
4510
+ Tokens: [CLS] My dog is cute [SEP] He likes play ##ing [SEP]<br>
4511
+ Segments: E_A E_A E_A E_A E_A E_A E_B E_B E_B E_B E_B<br>
4512
+ Positions: 0 1 2 3 4 5 6 7 8 9 10
4513
+ </div>
4514
+
4515
+ <h3>Fine-tuning for Different Tasks</h3>
4516
  <table>
4517
+ <tr><th>Task Type</th><th>Input Format</th><th>Output</th></tr>
4518
+ <tr>
4519
+ <td>Classification</td>
4520
+ <td>[CLS] text [SEP]</td>
4521
+ <td>[CLS] representation → classifier</td>
4522
+ </tr>
4523
+ <tr>
4524
+ <td>Sentence Pair</td>
4525
+ <td>[CLS] sent A [SEP] sent B [SEP]</td>
4526
+ <td>[CLS] representation → classifier</td>
4527
+ </tr>
4528
+ <tr>
4529
+ <td>Question Answering</td>
4530
+ <td>[CLS] question [SEP] passage [SEP]</td>
4531
+ <td>Start/End span vectors over passage tokens</td>
4532
+ </tr>
4533
+ <tr>
4534
+ <td>Token Classification</td>
4535
+ <td>[CLS] text [SEP]</td>
4536
+ <td>Each token representation → label</td>
4537
+ </tr>
4538
  </table>
4539
  `,
4540
  math: `
4541
+ <h3>Pre-training Objective</h3>
4542
+ <p>BERT simultaneously optimizes two unsupervised tasks:</p>
4543
+
4544
+ <div class="formula">
4545
+ L = L<sub>MLM</sub> + L<sub>NSP</sub>
4546
+ </div>
4547
+
4548
  <h3>Masked Language Modeling (MLM)</h3>
4549
+
4550
+ <div class="callout insight">
4551
+ <div class="callout-title">📝 Paper & Pain: The Masking Strategy</div>
4552
+ <strong>Problem:</strong> Standard left-to-right language modeling can't capture bidirectional context.<br>
4553
+ <strong>Solution:</strong> Randomly mask 15% of tokens and predict them using full context.<br><br>
4554
+
4555
+ <strong>However:</strong> [MASK] token doesn't appear during fine-tuning!<br>
4556
+ <strong>Clever Fix:</strong> Of the 15% selected tokens:<br>
4557
+ • 80% → Replace with [MASK]<br>
4558
+ • 10% → Replace with random token<br>
4559
+ • 10% → Keep unchanged<br><br>
4560
+
4561
+ This forces the model to maintain context representations for ALL tokens!
4562
+ </div>
4563
+
4564
+ <h4>MLM Loss Derivation</h4>
4565
+ <p>Let's work through the MLM objective step by step:</p>
4566
 
4567
  <div class="formula">
4568
+ Given input sequence: x = [x₁, x₂, ..., x_n]<br>
4569
+ Masked sequence: x̃ = [x̃₁, x̃₂, ..., x̃_n]<br>
4570
+ <br>
4571
+ Let M = {i₁, i₂, ..., i_m} be indices of masked tokens<br>
4572
+ <br>
4573
+ For each masked position i ∈ M:<br>
4574
+ h_i = BERT(x̃)_i (hidden state at position i)<br>
4575
+ logits_i = W · h_i + b (W ∈ ℝ^(V×H), vocab size V)<br>
4576
+ P(x_i | x̃) = softmax(logits_i)<br>
4577
  <br>
4578
+ Cross-entropy loss per token:<br>
4579
+ L_i = -log P(x_i | x̃)<br>
4580
+ <br>
4581
+ Total MLM loss:<br>
4582
+ L_MLM = (1/|M|) Σ_{i∈M} L_i<br>
4583
+ L_MLM = -(1/|M|) Σ_{i∈M} log P(x_i | x̃)
4584
  </div>
4585
 
4586
+ <div class="callout warning">
4587
+ <div class="callout-title">📊 Worked Example: MLM Calculation</div>
4588
+ <strong>Input:</strong> "The cat sat on the mat"<br>
4589
+ <strong>After masking (15%):</strong> "The [MASK] sat on the mat"<br>
4590
+ <strong>Target:</strong> Predict "cat" at position 2<br><br>
4591
+
4592
+ <strong>Step 1:</strong> Forward pass through BERT<br>
4593
+ h₂ = BERT(x̃)₂ ∈ ℝ^768 (for BERT_BASE)<br><br>
4594
+
4595
+ <strong>Step 2:</strong> Project to vocabulary space<br>
4596
+ logits₂ = W · h₂ + b ∈ ℝ^30000<br><br>
4597
+
4598
+ <strong>Step 3:</strong> Compute probabilities<br>
4599
+ P(w | x̃) = exp(logits₂[w]) / Σ_v exp(logits₂[v])<br><br>
4600
+
4601
+ <strong>Step 4:</strong> Compute loss (assume P("cat"|x̃) = 0.73)<br>
4602
+ L = -log(0.73) = 0.315
4603
+ </div>
4604
+
4605
+ <h3>Next Sentence Prediction (NSP)</h3>
4606
+ <p>Binary classification task to understand sentence relationships.</p>
4607
+
4608
+ <div class="formula">
4609
+ Input: [CLS] sentence_A [SEP] sentence_B [SEP]<br>
4610
+ <br>
4611
+ Let C = final hidden state of [CLS] token ∈ ℝ^H<br>
4612
+ <br>
4613
+ P(IsNext = True) = σ(W_NSP · C)<br>
4614
+ where σ = sigmoid function, W_NSP ∈ ℝ^(1×H)<br>
4615
+ <br>
4616
+ Binary cross-entropy loss:<br>
4617
+ L_NSP = -[y·log(ŷ) + (1-y)·log(1-ŷ)]<br>
4618
+ where y = 1 if B follows A, else 0
4619
+ </div>
4620
+
4621
+ <h4>NSP Training Data Generation</h4>
4622
+ <ul>
4623
+ <li><strong>50% IsNext:</strong> B actually follows A in corpus</li>
4624
+ <li><strong>50% NotNext:</strong> B sampled randomly from another document</li>
4625
+ </ul>
4626
+
4627
+ <h3>Fine-tuning Math: Question Answering (SQuAD)</h3>
4628
+
4629
+ <div class="formula">
4630
+ Input: [CLS] question [SEP] paragraph [SEP]<br>
4631
+ <br>
4632
+ Let T_i = final hidden state for token i in paragraph<br>
4633
+ <br>
4634
+ Start position logits: S_i = W_start · T_i<br>
4635
+ End position logits: E_i = W_end · T_i<br>
4636
+ <br>
4637
+ P(start = i) = softmax(S)_i<br>
4638
+ P(end = j) = softmax(E)_j<br>
4639
+ <br>
4640
+ Answer span = tokens from position i to j<br>
4641
+ <br>
4642
+ Training loss:<br>
4643
+ L = -log P(start = i*) - log P(end = j*)<br>
4644
+ where i*, j* are ground truth positions
4645
  </div>
4646
  `,
4647
  applications: `
4648
+ <h3>SQuAD Benchmark Performance</h3>
4649
+
4650
+ <div class="info-box">
4651
+ <div class="box-title">🏆 Stanford Question Answering Dataset (SQuAD)</div>
4652
+ <div class="box-content">
4653
+ <strong>SQuAD 1.1:</strong> 100,000+ question-answer pairs on 500+ Wikipedia articles. Every question has an answer span in the passage.<br><br>
4654
+ <strong>SQuAD 2.0:</strong> Adds 50,000+ unanswerable questions. Models must determine when no answer exists.<br><br>
4655
+ <strong>Evaluation Metrics:</strong><br>
4656
+ • <strong>EM (Exact Match):</strong> % of predictions matching ground truth exactly<br>
4657
+ • <strong>F1:</strong> Token-level overlap between prediction and ground truth
4658
+ </div>
4659
+ </div>
4660
+
4661
+ <h3>SQuAD 1.1 Results</h3>
4662
+ <table>
4663
+ <tr><th>Model</th><th>EM</th><th>F1</th></tr>
4664
+ <tr><td>Human Performance</td><td>82.3</td><td>91.2</td></tr>
4665
+ <tr><td>BERT<sub>BASE</sub></td><td>80.8</td><td>88.5</td></tr>
4666
+ <tr><td>BERT<sub>LARGE</sub></td><td><strong>84.1</strong></td><td><strong>90.9</strong></td></tr>
4667
+ </table>
4668
+ <p><em>BERT<sub>LARGE</sub> surpassed human performance on EM!</em></p>
4669
+
4670
+ <h3>SQuAD 2.0 Results</h3>
4671
+ <table>
4672
+ <tr><th>Model</th><th>EM</th><th>F1</th></tr>
4673
+ <tr><td>Human Performance</td><td>86.9</td><td>89.5</td></tr>
4674
+ <tr><td>BERT<sub>BASE</sub></td><td>73.7</td><td>76.3</td></tr>
4675
+ <tr><td>BERT<sub>LARGE</sub></td><td><strong>78.7</strong></td><td><strong>81.9</strong></td></tr>
4676
+ </table>
4677
+
4678
+ <div class="callout tip">
4679
+ <div class="callout-title">💡 Example SQuAD Question</div>
4680
+ <strong>Passage:</strong> "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France."<br><br>
4681
+ <strong>Question:</strong> "In what country is Normandy located?"<br><br>
4682
+ <strong>BERT Answer:</strong> "France" ✓<br>
4683
+ <strong>Start Token:</strong> position 32<br>
4684
+ <strong>End Token:</strong> position 32
4685
+ </div>
4686
+
4687
+ <h3>GLUE Benchmark (General Language Understanding Evaluation)</h3>
4688
+ <p>BERT set new state-of-the-art on all 9 GLUE tasks:</p>
4689
+ <table>
4690
+ <tr><th>Task</th><th>Metric</th><th>Previous SOTA</th><th>BERT<sub>LARGE</sub></th></tr>
4691
+ <tr><td>MNLI (NLI)</td><td>Acc</td><td>86.6</td><td><strong>86.7</strong></td></tr>
4692
+ <tr><td>QQP (Paraphrase)</td><td>F1</td><td>66.1</td><td><strong>72.1</strong></td></tr>
4693
+ <tr><td>QNLI (QA/NLI)</td><td>Acc</td><td>87.4</td><td><strong>92.7</strong></td></tr>
4694
+ <tr><td>SST-2 (Sentiment)</td><td>Acc</td><td>93.2</td><td><strong>94.9</strong></td></tr>
4695
+ <tr><td>CoLA (Acceptability)</td><td>Matthew's</td><td>35.0</td><td><strong>60.5</strong></td></tr>
4696
+ </table>
4697
+
4698
+ <h3>Additional Applications</h3>
4699
+
4700
  <div class="info-box">
4701
+ <div class="box-title">🔍 Google Search</div>
4702
  <div class="box-content">
4703
+ In October 2019, Google began using BERT for 1 in 10 English search queries, calling it the biggest leap in 5 years. BERT helps understand search intent and context.
 
4704
  </div>
4705
  </div>
4706
+
4707
+ <div class="info-box">
4708
+ <div class="box-title">🏷️ Named Entity Recognition (NER)</div>
4709
+ <div class="box-content">
4710
+ BERT excels at identifying entities (person, location, organization) in text by treating it as token classification. Each token gets a label (B-PER, I-PER, B-LOC, etc.).
4711
+ </div>
4712
+ </div>
4713
+
4714
  <div class="info-box">
4715
  <div class="box-title">📊 Text Classification</div>
4716
+ <div class="box-content">
4717
+ Sentiment analysis, topic classification, spam detection - all benefit from BERT's contextual understanding. Simply use [CLS] representation with a classifier.
4718
+ </div>
4719
+ </div>
4720
+
4721
+ <h3>Using BERT: Quick Code Example</h3>
4722
+ <div class="formula">
4723
+ # Using Hugging Face Transformers<br>
4724
+ from transformers import BertTokenizer, BertForQuestionAnswering<br>
4725
+ import torch<br>
4726
+ <br>
4727
+ # Load pre-trained model and tokenizer<br>
4728
+ tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')<br>
4729
+ model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')<br>
4730
+ <br>
4731
+ # Example<br>
4732
+ question = "What is BERT?"<br>
4733
+ context = "BERT is a bidirectional Transformer for NLP."<br>
4734
+ <br>
4735
+ # Tokenize and get answer<br>
4736
+ inputs = tokenizer(question, context, return_tensors='pt')<br>
4737
+ outputs = model(**inputs)<br>
4738
+ <br>
4739
+ start_idx = torch.argmax(outputs.start_logits)<br>
4740
+ end_idx = torch.argmax(outputs.end_logits)<br>
4741
+ answer = tokenizer.convert_tokens_to_string(<br>
4742
+ tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx+1])<br>
4743
+ )<br>
4744
+ print(answer) # "a bidirectional Transformer for NLP"
4745
  </div>
4746
  `
4747
  },