Spaces:
Running
Running
Commit ·
797e92e
1
Parent(s): 3ccb99b
feat: integrate PDF book notes into DL modules
Browse files- Optimizers: Added AdaGrad, comparison table, 10 interview questions
- Activation Functions: Added Swish, GELU, Dead Neurons, How-to-Choose, 10 IQs
- Loss Functions: Added Huber Loss, Hinge Loss, comparison table, 10 IQs
- Backpropagation: Added Forward Prop, Training Pipeline, Training Terms, 10 IQs
- Regularization: Added Weight Init (Xavier/He), Dropout vs BatchNorm, 10 IQs
Total: 50 interview questions from DeppLEarning.pdf integrated
- DeepLearning/index.html +588 -50
DeepLearning/index.html
CHANGED
|
@@ -943,6 +943,18 @@
|
|
| 943 |
<td>Multi-class output</td>
|
| 944 |
<td>Computationally expensive</td>
|
| 945 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 946 |
</table>
|
| 947 |
`,
|
| 948 |
concepts: `
|
|
@@ -972,6 +984,49 @@
|
|
| 972 |
• Try <strong>Leaky ReLU</strong> or <strong>ELU</strong> if ReLU neurons are dying<br>
|
| 973 |
• Avoid Sigmoid/Tanh in deep networks (gradient vanishing)
|
| 974 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 975 |
`,
|
| 976 |
applications: `
|
| 977 |
<div class="info-box">
|
|
@@ -986,6 +1041,20 @@
|
|
| 986 |
Different tasks need different outputs: Sigmoid for binary, Softmax for multi-class, Linear for regression
|
| 987 |
</div>
|
| 988 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 989 |
`,
|
| 990 |
math: `
|
| 991 |
<h3>Derivatives: The Backprop Fuel</h3>
|
|
@@ -2068,11 +2137,27 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2068 |
<h3>Common Loss Functions</h3>
|
| 2069 |
<div class="list-item">
|
| 2070 |
<div class="list-num">01</div>
|
| 2071 |
-
<div><strong>MSE:</strong> (1/n)Σ(y - ŷ)² - Penalizes large errors</div>
|
| 2072 |
</div>
|
| 2073 |
<div class="list-item">
|
| 2074 |
<div class="list-num">02</div>
|
| 2075 |
-
<div><strong>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2076 |
</div>
|
| 2077 |
`,
|
| 2078 |
applications: `
|
|
@@ -2088,6 +2173,31 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2088 |
Business-specific objectives: Focal Loss (imbalanced data), Dice Loss (segmentation), Contrastive Loss (similarity learning)
|
| 2089 |
</div>
|
| 2090 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2091 |
`,
|
| 2092 |
math: `
|
| 2093 |
<h3>Binary Cross-Entropy (BCE) Derivation</h3>
|
|
@@ -2097,6 +2207,25 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2097 |
L(ŷ, y) = -(y log(ŷ) + (1-y) log(1-ŷ))
|
| 2098 |
</div>
|
| 2099 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2100 |
<h3>Paper & Pain: Why not MSE for Classification?</h3>
|
| 2101 |
<p>If we use MSE for sigmoid output, the gradient is:</p>
|
| 2102 |
<div class="formula">
|
|
@@ -2171,6 +2300,30 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2171 |
CNNs: SGD+Momentum | Transformers: AdamW | RNNs: RMSprop | Default: Adam
|
| 2172 |
</div>
|
| 2173 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2174 |
`,
|
| 2175 |
math: `
|
| 2176 |
<h3>Gradient Descent: The Foundation</h3>
|
|
@@ -2210,8 +2363,27 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2210 |
Momentum accumulates past gradients for faster convergence.
|
| 2211 |
</div>
|
| 2212 |
|
| 2213 |
-
<h3>3.
|
| 2214 |
-
<p>Adapts learning rate per-parameter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2215 |
|
| 2216 |
<div class="formula">
|
| 2217 |
v_t = β × v_{t-1} + (1-β) × (∇L)²<br>
|
|
@@ -2219,8 +2391,8 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2219 |
β = 0.9, ε = 1e-8 (numerical stability)
|
| 2220 |
</div>
|
| 2221 |
|
| 2222 |
-
<h3>
|
| 2223 |
-
<p>Combines momentum AND adaptive learning rates. The most popular optimizer.</p>
|
| 2224 |
|
| 2225 |
<div class="formula" style="background: rgba(255, 107, 53, 0.08); padding: 20px; border-radius: 8px;">
|
| 2226 |
<strong>Step 1 - First Moment (Momentum):</strong><br>
|
|
@@ -2254,15 +2426,40 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2254 |
},
|
| 2255 |
"backprop": {
|
| 2256 |
overview: `
|
| 2257 |
-
<h3>
|
| 2258 |
-
<p>
|
| 2259 |
|
| 2260 |
-
<h3>
|
| 2261 |
-
<
|
| 2262 |
-
|
| 2263 |
-
|
| 2264 |
-
|
| 2265 |
-
</
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2266 |
`,
|
| 2267 |
concepts: `
|
| 2268 |
<div class="formula">
|
|
@@ -2287,6 +2484,20 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2287 |
PyTorch, TensorFlow implement automatic backprop - you define forward pass, framework does backward
|
| 2288 |
</div>
|
| 2289 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2290 |
`,
|
| 2291 |
math: `
|
| 2292 |
<h3>The 4 Fundamental Equations of Backprop</h3>
|
|
@@ -2364,6 +2575,16 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2364 |
<td>Computer vision (rotations, flips, crops)</td>
|
| 2365 |
</tr>
|
| 2366 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2367 |
`,
|
| 2368 |
applications: `
|
| 2369 |
<div class="info-box">
|
|
@@ -2375,6 +2596,29 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2375 |
• Data Augmentation for images
|
| 2376 |
</div>
|
| 2377 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2378 |
`,
|
| 2379 |
math: `
|
| 2380 |
<h3>L2 Regularization (Weight Decay)</h3>
|
|
@@ -2442,6 +2686,30 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 2442 |
(where n = number of neurons that can be dropped)<br><br>
|
| 2443 |
Each forward pass is a different architecture!
|
| 2444 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2445 |
`
|
| 2446 |
},
|
| 2447 |
"batch-norm": {
|
|
@@ -4128,82 +4396,352 @@ output, attn_weights = mha(x, x, x) <span style="color: #6c7086;"># Self-attent
|
|
| 4128 |
},
|
| 4129 |
"bert": {
|
| 4130 |
overview: `
|
| 4131 |
-
<h3>BERT
|
| 4132 |
-
<p>Pre-
|
|
|
|
| 4133 |
|
| 4134 |
-
<
|
| 4135 |
-
|
|
|
|
|
|
|
| 4136 |
|
| 4137 |
-
<h3>
|
| 4138 |
<ul>
|
| 4139 |
-
<li><strong>
|
| 4140 |
-
<li><strong>
|
|
|
|
|
|
|
| 4141 |
</ul>
|
| 4142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4143 |
<div class="callout tip">
|
| 4144 |
-
<div class="callout-title">💡 Fine-tuning
|
| 4145 |
-
1.
|
| 4146 |
-
2.
|
| 4147 |
-
3.
|
| 4148 |
-
4.
|
|
|
|
| 4149 |
</div>
|
| 4150 |
`,
|
| 4151 |
concepts: `
|
| 4152 |
<h3>BERT Architecture</h3>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4153 |
<div class="list-item">
|
| 4154 |
<div class="list-num">01</div>
|
| 4155 |
-
<div><strong>
|
| 4156 |
</div>
|
| 4157 |
<div class="list-item">
|
| 4158 |
<div class="list-num">02</div>
|
| 4159 |
-
<div><strong>
|
| 4160 |
</div>
|
| 4161 |
<div class="list-item">
|
| 4162 |
<div class="list-num">03</div>
|
| 4163 |
-
<div><strong>
|
| 4164 |
</div>
|
| 4165 |
-
|
| 4166 |
-
|
| 4167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4168 |
</div>
|
| 4169 |
|
| 4170 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4171 |
<table>
|
| 4172 |
-
<tr><th>
|
| 4173 |
-
<tr>
|
| 4174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4175 |
</table>
|
| 4176 |
`,
|
| 4177 |
math: `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4178 |
<h3>Masked Language Modeling (MLM)</h3>
|
| 4179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4180 |
|
| 4181 |
<div class="formula">
|
| 4182 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4183 |
<br>
|
| 4184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4185 |
</div>
|
| 4186 |
|
| 4187 |
-
<div class="callout
|
| 4188 |
-
<div class="callout-title">
|
| 4189 |
-
|
| 4190 |
-
|
| 4191 |
-
|
| 4192 |
-
|
| 4193 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4194 |
</div>
|
| 4195 |
`,
|
| 4196 |
applications: `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4197 |
<div class="info-box">
|
| 4198 |
-
<div class="box-title">🔍
|
| 4199 |
<div class="box-content">
|
| 4200 |
-
|
| 4201 |
-
Question answering systems, document retrieval
|
| 4202 |
</div>
|
| 4203 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4204 |
<div class="info-box">
|
| 4205 |
<div class="box-title">📊 Text Classification</div>
|
| 4206 |
-
<div class="box-content">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4207 |
</div>
|
| 4208 |
`
|
| 4209 |
},
|
|
|
|
| 943 |
<td>Multi-class output</td>
|
| 944 |
<td>Computationally expensive</td>
|
| 945 |
</tr>
|
| 946 |
+
<tr>
|
| 947 |
+
<td>GELU</td>
|
| 948 |
+
<td>(-0.17, ∞)</td>
|
| 949 |
+
<td>Transformers (BERT, GPT)</td>
|
| 950 |
+
<td>Computationally expensive</td>
|
| 951 |
+
</tr>
|
| 952 |
+
<tr>
|
| 953 |
+
<td>Swish</td>
|
| 954 |
+
<td>(-0.28, ∞)</td>
|
| 955 |
+
<td>Deep networks (40+ layers)</td>
|
| 956 |
+
<td>Slightly slower than ReLU</td>
|
| 957 |
+
</tr>
|
| 958 |
</table>
|
| 959 |
`,
|
| 960 |
concepts: `
|
|
|
|
| 984 |
• Try <strong>Leaky ReLU</strong> or <strong>ELU</strong> if ReLU neurons are dying<br>
|
| 985 |
• Avoid Sigmoid/Tanh in deep networks (gradient vanishing)
|
| 986 |
</div>
|
| 987 |
+
|
| 988 |
+
<div class="callout warning">
|
| 989 |
+
<div class="callout-title">⚠️ Dead Neurons (Dying ReLU Problem)</div>
|
| 990 |
+
When a neuron's input is always negative, ReLU outputs 0 and its gradient is 0.<br>
|
| 991 |
+
The neuron <strong>never updates</strong> — it's permanently "dead".<br><br>
|
| 992 |
+
<strong>Solutions:</strong><br>
|
| 993 |
+
• Use <strong>Leaky ReLU</strong> (small slope for negative values)<br>
|
| 994 |
+
• Use <strong>ELU</strong> (exponential for negative values)<br>
|
| 995 |
+
• Careful weight initialization (He Initialization)
|
| 996 |
+
</div>
|
| 997 |
+
|
| 998 |
+
<h3>GELU (Gaussian Error Linear Unit)</h3>
|
| 999 |
+
<p>Used in <strong>Transformers, BERT, and GPT</strong>. GELU multiplies the input by the probability that it's positive under a Gaussian distribution.</p>
|
| 1000 |
+
<div class="formula">
|
| 1001 |
+
GELU(x) = x × Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
|
| 1002 |
+
</div>
|
| 1003 |
+
|
| 1004 |
+
<h3>Swish (Self-Gated Activation)</h3>
|
| 1005 |
+
<p>Developed by Google researchers. Consistently matches or outperforms ReLU on deep networks.</p>
|
| 1006 |
+
<div class="formula">
|
| 1007 |
+
Swish(x) = x × σ(βx) where σ = Sigmoid, β = learnable parameter
|
| 1008 |
+
</div>
|
| 1009 |
+
<div class="callout tip">
|
| 1010 |
+
<div class="callout-title">💡 Why Swish is Better</div>
|
| 1011 |
+
• <strong>Smooth:</strong> Doesn't abruptly change direction like ReLU at x=0<br>
|
| 1012 |
+
• <strong>Non-monotonous:</strong> Small negative values preserved (not zeroed like ReLU)<br>
|
| 1013 |
+
• <strong>Unbounded above, bounded below:</strong> Best of both worlds<br>
|
| 1014 |
+
• Best for networks with depth > 40 layers
|
| 1015 |
+
</div>
|
| 1016 |
+
|
| 1017 |
+
<h3>How to Choose Activation Functions</h3>
|
| 1018 |
+
<table>
|
| 1019 |
+
<tr><th>Layer / Task</th><th>Recommended</th></tr>
|
| 1020 |
+
<tr><td>Hidden layers (default)</td><td>ReLU</td></tr>
|
| 1021 |
+
<tr><td>Regression output</td><td>Linear (no activation)</td></tr>
|
| 1022 |
+
<tr><td>Binary classification output</td><td>Sigmoid</td></tr>
|
| 1023 |
+
<tr><td>Multi-class classification</td><td>Softmax</td></tr>
|
| 1024 |
+
<tr><td>Multi-label classification</td><td>Sigmoid</td></tr>
|
| 1025 |
+
<tr><td>CNN hidden layers</td><td>ReLU</td></tr>
|
| 1026 |
+
<tr><td>RNN hidden layers</td><td>Tanh / Sigmoid</td></tr>
|
| 1027 |
+
<tr><td>Transformers</td><td>GELU</td></tr>
|
| 1028 |
+
<tr><td>Deep networks (40+ layers)</td><td>Swish</td></tr>
|
| 1029 |
+
</table>
|
| 1030 |
`,
|
| 1031 |
applications: `
|
| 1032 |
<div class="info-box">
|
|
|
|
| 1041 |
Different tasks need different outputs: Sigmoid for binary, Softmax for multi-class, Linear for regression
|
| 1042 |
</div>
|
| 1043 |
</div>
|
| 1044 |
+
|
| 1045 |
+
<div class="callout tip">
|
| 1046 |
+
<div class="callout-title">🎤 Probable Interview Questions</div>
|
| 1047 |
+
1. Why do we need activation functions?<br>
|
| 1048 |
+
2. What is vanishing gradient?<br>
|
| 1049 |
+
3. Why is ReLU preferred over sigmoid?<br>
|
| 1050 |
+
4. What are dead neurons?<br>
|
| 1051 |
+
5. Difference between ReLU and Leaky ReLU?<br>
|
| 1052 |
+
6. Why softmax instead of sigmoid for multiclass?<br>
|
| 1053 |
+
7. Why linear activation for regression output?<br>
|
| 1054 |
+
8. Why GELU is used in transformers?<br>
|
| 1055 |
+
9. Can activation function affect convergence speed?<br>
|
| 1056 |
+
10. What happens if we remove activation functions?
|
| 1057 |
+
</div>
|
| 1058 |
`,
|
| 1059 |
math: `
|
| 1060 |
<h3>Derivatives: The Backprop Fuel</h3>
|
|
|
|
| 2137 |
<h3>Common Loss Functions</h3>
|
| 2138 |
<div class="list-item">
|
| 2139 |
<div class="list-num">01</div>
|
| 2140 |
+
<div><strong>MSE:</strong> (1/n)Σ(y - ŷ)² - Penalizes large errors, sensitive to outliers</div>
|
| 2141 |
</div>
|
| 2142 |
<div class="list-item">
|
| 2143 |
<div class="list-num">02</div>
|
| 2144 |
+
<div><strong>MAE:</strong> (1/n)Σ|y - ŷ| - Robust to outliers, constant gradient, slower convergence</div>
|
| 2145 |
+
</div>
|
| 2146 |
+
<div class="list-item">
|
| 2147 |
+
<div class="list-num">03</div>
|
| 2148 |
+
<div><strong>Huber Loss:</strong> MSE when |error| ≤ δ, MAE otherwise. Best of both — smooth + robust to outliers</div>
|
| 2149 |
+
</div>
|
| 2150 |
+
<div class="list-item">
|
| 2151 |
+
<div class="list-num">04</div>
|
| 2152 |
+
<div><strong>BCE (Binary Cross-Entropy):</strong> -[y·log(ŷ) + (1-y)·log(1-ŷ)] - Used with Sigmoid</div>
|
| 2153 |
+
</div>
|
| 2154 |
+
<div class="list-item">
|
| 2155 |
+
<div class="list-num">05</div>
|
| 2156 |
+
<div><strong>CCE (Categorical Cross-Entropy):</strong> -Σ y·log(ŷ) - Used with Softmax for multi-class</div>
|
| 2157 |
+
</div>
|
| 2158 |
+
<div class="list-item">
|
| 2159 |
+
<div class="list-num">06</div>
|
| 2160 |
+
<div><strong>Hinge Loss:</strong> max(0, 1 - y·ŷ) where y ∈ {-1, +1} - Margin-based, SVM-style</div>
|
| 2161 |
</div>
|
| 2162 |
`,
|
| 2163 |
applications: `
|
|
|
|
| 2173 |
Business-specific objectives: Focal Loss (imbalanced data), Dice Loss (segmentation), Contrastive Loss (similarity learning)
|
| 2174 |
</div>
|
| 2175 |
</div>
|
| 2176 |
+
|
| 2177 |
+
<h3>Loss Function Comparison</h3>
|
| 2178 |
+
<table>
|
| 2179 |
+
<tr><th>Loss</th><th>Type</th><th>Outlier Sensitivity</th><th>Key Property</th></tr>
|
| 2180 |
+
<tr><td>MSE</td><td>Regression</td><td>High</td><td>Penalizes large errors heavily</td></tr>
|
| 2181 |
+
<tr><td>MAE</td><td>Regression</td><td>Low</td><td>Robust, constant gradient</td></tr>
|
| 2182 |
+
<tr><td>Huber</td><td>Regression</td><td>Medium</td><td>Smooth + robust (MSE+MAE combo)</td></tr>
|
| 2183 |
+
<tr><td>BCE</td><td>Binary Class.</td><td>High</td><td>Strong gradients for wrong predictions</td></tr>
|
| 2184 |
+
<tr><td>CCE</td><td>Multi-class</td><td>High</td><td>Outputs probabilities via Softmax</td></tr>
|
| 2185 |
+
<tr><td>Hinge</td><td>Binary Class.</td><td>Medium</td><td>Margin-based, less probabilistic</td></tr>
|
| 2186 |
+
</table>
|
| 2187 |
+
|
| 2188 |
+
<div class="callout tip">
|
| 2189 |
+
<div class="callout-title">🎤 Probable Interview Questions</div>
|
| 2190 |
+
1. Difference between MSE and MAE?<br>
|
| 2191 |
+
2. Why Huber loss is preferred sometimes?<br>
|
| 2192 |
+
3. Why BCE with sigmoid?<br>
|
| 2193 |
+
4. Why softmax with CCE?<br>
|
| 2194 |
+
5. Why can't we use MSE for classification?<br>
|
| 2195 |
+
6. What is Hinge loss and where is it used?<br>
|
| 2196 |
+
7. Difference between loss function and evaluation metric?<br>
|
| 2197 |
+
8. How does loss choice affect gradients?<br>
|
| 2198 |
+
9. What is Focal Loss and when to use it?<br>
|
| 2199 |
+
10. Can we design custom loss functions?
|
| 2200 |
+
</div>
|
| 2201 |
`,
|
| 2202 |
math: `
|
| 2203 |
<h3>Binary Cross-Entropy (BCE) Derivation</h3>
|
|
|
|
| 2207 |
L(ŷ, y) = -(y log(ŷ) + (1-y) log(1-ŷ))
|
| 2208 |
</div>
|
| 2209 |
|
| 2210 |
+
<h3>Huber Loss (Smooth MAE)</h3>
|
| 2211 |
+
<p>Combines MSE for small errors and MAE for large errors using threshold δ:</p>
|
| 2212 |
+
<div class="formula">
|
| 2213 |
+
L = ½(y - ŷ)² when |y - ŷ| ≤ δ<br>
|
| 2214 |
+
L = δ|y - ŷ| - ½δ² otherwise
|
| 2215 |
+
</div>
|
| 2216 |
+
<div class="callout insight">
|
| 2217 |
+
<div class="callout-title">📝 Paper & Pain: Huber Intuition</div>
|
| 2218 |
+
<strong>Small error (|error| ≤ δ):</strong> Behaves like MSE — smooth, differentiable<br>
|
| 2219 |
+
<strong>Large error (|error| > δ):</strong> Behaves like MAE — doesn't blow up for outliers<br><br>
|
| 2220 |
+
Best of both worlds! Used when data contains mild outliers.
|
| 2221 |
+
</div>
|
| 2222 |
+
|
| 2223 |
+
<h3>Hinge Loss (SVM-style)</h3>
|
| 2224 |
+
<div class="formula">
|
| 2225 |
+
L = (1/n) Σ max(0, 1 - y·ŷ) where y ∈ {-1, +1}
|
| 2226 |
+
</div>
|
| 2227 |
+
<p>Margin-based loss: only penalizes predictions within the margin boundary. Used in SVMs and some neural network classifiers.</p>
|
| 2228 |
+
|
| 2229 |
<h3>Paper & Pain: Why not MSE for Classification?</h3>
|
| 2230 |
<p>If we use MSE for sigmoid output, the gradient is:</p>
|
| 2231 |
<div class="formula">
|
|
|
|
| 2300 |
CNNs: SGD+Momentum | Transformers: AdamW | RNNs: RMSprop | Default: Adam
|
| 2301 |
</div>
|
| 2302 |
</div>
|
| 2303 |
+
|
| 2304 |
+
<h3>Optimizer Comparison</h3>
|
| 2305 |
+
<table>
|
| 2306 |
+
<tr><th>Optimizer</th><th>Key Idea</th><th>Problem</th></tr>
|
| 2307 |
+
<tr><td>SGD</td><td>Simple, fast</td><td>Noisy convergence</td></tr>
|
| 2308 |
+
<tr><td>Momentum</td><td>Smooths updates</td><td>Needs tuning</td></tr>
|
| 2309 |
+
<tr><td>AdaGrad</td><td>Adaptive LR</td><td>LR shrinks too much</td></tr>
|
| 2310 |
+
<tr><td>RMSProp</td><td>Fixes AdaGrad</td><td>No momentum</td></tr>
|
| 2311 |
+
<tr><td><strong>Adam</strong></td><td><strong>Best of all</strong></td><td>Slightly more computation</td></tr>
|
| 2312 |
+
</table>
|
| 2313 |
+
|
| 2314 |
+
<div class="callout tip">
|
| 2315 |
+
<div class="callout-title">🎤 Probable Interview Questions</div>
|
| 2316 |
+
1. Difference between optimizer and gradient descent?<br>
|
| 2317 |
+
2. Why does SGD oscillate?<br>
|
| 2318 |
+
3. Why does AdaGrad fail in deep networks?<br>
|
| 2319 |
+
4. How does RMSProp fix AdaGrad?<br>
|
| 2320 |
+
5. Why is bias correction needed in Adam?<br>
|
| 2321 |
+
6. What happens if learning rate is too high?<br>
|
| 2322 |
+
7. When would you prefer SGD over Adam?<br>
|
| 2323 |
+
8. What is momentum intuitively?<br>
|
| 2324 |
+
9. Why is Adam the default choice?<br>
|
| 2325 |
+
10. Can Adam overfit?
|
| 2326 |
+
</div>
|
| 2327 |
`,
|
| 2328 |
math: `
|
| 2329 |
<h3>Gradient Descent: The Foundation</h3>
|
|
|
|
| 2363 |
Momentum accumulates past gradients for faster convergence.
|
| 2364 |
</div>
|
| 2365 |
|
| 2366 |
+
<h3>3. AdaGrad (Adaptive Gradient)</h3>
|
| 2367 |
+
<p>Adapts learning rate per-parameter based on how frequently each parameter is updated.</p>
|
| 2368 |
+
|
| 2369 |
+
<div class="formula">
|
| 2370 |
+
<strong>Accumulated Gradient:</strong><br>
|
| 2371 |
+
G_t = G_{t-1} + (∇L)²<br><br>
|
| 2372 |
+
<strong>Update Rule:</strong><br>
|
| 2373 |
+
w_{t+1} = w_t - η / √(G_t + ε) × ∇L<br><br>
|
| 2374 |
+
Where ε = 1e-8 (numerical stability)
|
| 2375 |
+
</div>
|
| 2376 |
+
|
| 2377 |
+
<div class="callout insight">
|
| 2378 |
+
<div class="callout-title">📝 Paper & Pain: AdaGrad Intuition</div>
|
| 2379 |
+
<strong>Frequent parameters</strong> → G_t grows fast → learning rate shrinks<br>
|
| 2380 |
+
<strong>Rare parameters</strong> → G_t stays small → learning rate stays large<br><br>
|
| 2381 |
+
<strong>Problem:</strong> G_t only accumulates (never forgets), so learning rate keeps shrinking and training may stop early!<br>
|
| 2382 |
+
<strong>This is exactly why RMSprop was invented →</strong>
|
| 2383 |
+
</div>
|
| 2384 |
+
|
| 2385 |
+
<h3>4. RMSprop (Root Mean Square Propagation)</h3>
|
| 2386 |
+
<p>Fixes AdaGrad's shrinking problem by using a <strong>decaying average</strong> of recent squared gradients instead of summing all.</p>
|
| 2387 |
|
| 2388 |
<div class="formula">
|
| 2389 |
v_t = β × v_{t-1} + (1-β) × (∇L)²<br>
|
|
|
|
| 2391 |
β = 0.9, ε = 1e-8 (numerical stability)
|
| 2392 |
</div>
|
| 2393 |
|
| 2394 |
+
<h3>5. Adam (Adaptive Moment Estimation)</h3>
|
| 2395 |
+
<p>Combines momentum (from SGD) AND adaptive learning rates (from RMSprop). The most popular optimizer.</p>
|
| 2396 |
|
| 2397 |
<div class="formula" style="background: rgba(255, 107, 53, 0.08); padding: 20px; border-radius: 8px;">
|
| 2398 |
<strong>Step 1 - First Moment (Momentum):</strong><br>
|
|
|
|
| 2426 |
},
|
| 2427 |
"backprop": {
|
| 2428 |
overview: `
|
| 2429 |
+
<h3>Forward & Backpropagation</h3>
|
| 2430 |
+
<p>The neural network training loop consists of two passes: <strong>forward propagation</strong> (compute predictions) and <strong>backpropagation</strong> (compute gradients for updates).</p>
|
| 2431 |
|
| 2432 |
+
<h3>Forward Propagation</h3>
|
| 2433 |
+
<p>The process of moving inputs through the network to produce an output:</p>
|
| 2434 |
+
<div class="formula">
|
| 2435 |
+
Input → Weighted Sum → Activation → Output
|
| 2436 |
+
</div>
|
| 2437 |
+
<p>This happens: for every batch, in every epoch, before computing loss.</p>
|
| 2438 |
+
|
| 2439 |
+
<h3>Training Pipeline</h3>
|
| 2440 |
+
<table>
|
| 2441 |
+
<tr><th>Component</th><th>Role</th></tr>
|
| 2442 |
+
<tr><td>Forward Propagation</td><td>Computes predictions</td></tr>
|
| 2443 |
+
<tr><td>Loss Function</td><td>Computes error</td></tr>
|
| 2444 |
+
<tr><td>Backpropagation</td><td>Computes gradients</td></tr>
|
| 2445 |
+
<tr><td>Gradient Descent</td><td>Updates weights</td></tr>
|
| 2446 |
+
</table>
|
| 2447 |
+
|
| 2448 |
+
<div class="callout warning">
|
| 2449 |
+
<div class="callout-title">⚠️ Key Distinction</div>
|
| 2450 |
+
Backpropagation does <strong>NOT</strong> update weights — it only computes gradients.<br>
|
| 2451 |
+
<strong>Gradient Descent</strong> (or any optimizer) does the actual weight update!
|
| 2452 |
+
</div>
|
| 2453 |
+
|
| 2454 |
+
<h3>Training Terminologies</h3>
|
| 2455 |
+
<table>
|
| 2456 |
+
<tr><th>Term</th><th>Meaning</th><th>Example (1000 samples, batch=100)</th></tr>
|
| 2457 |
+
<tr><td>Batch</td><td>Subset of data</td><td>100 samples</td></tr>
|
| 2458 |
+
<tr><td>Batch Size</td><td>Samples per batch</td><td>100</td></tr>
|
| 2459 |
+
<tr><td>Steps per Epoch</td><td>Total / Batch Size</td><td>1000/100 = 10</td></tr>
|
| 2460 |
+
<tr><td>Iteration</td><td>One batch update</td><td>1 step</td></tr>
|
| 2461 |
+
<tr><td>Epoch</td><td>One full pass of dataset</td><td>10 iterations</td></tr>
|
| 2462 |
+
</table>
|
| 2463 |
`,
|
| 2464 |
concepts: `
|
| 2465 |
<div class="formula">
|
|
|
|
| 2484 |
PyTorch, TensorFlow implement automatic backprop - you define forward pass, framework does backward
|
| 2485 |
</div>
|
| 2486 |
</div>
|
| 2487 |
+
|
| 2488 |
+
<div class="callout tip">
|
| 2489 |
+
<div class="callout-title">🎤 Probable Interview Questions</div>
|
| 2490 |
+
1. What is the role of bias in a perceptron?<br>
|
| 2491 |
+
2. Why can't we use MSE for classification?<br>
|
| 2492 |
+
3. Difference between loss function and evaluation metric?<br>
|
| 2493 |
+
4. Why is mini-batch GD preferred?<br>
|
| 2494 |
+
5. Does backpropagation update weights?<br>
|
| 2495 |
+
6. Can gradient descent work without backpropagation?<br>
|
| 2496 |
+
7. What happens if learning rate is too high?<br>
|
| 2497 |
+
8. How many times does forward propagation occur per epoch?<br>
|
| 2498 |
+
9. What happens if we remove bias?<br>
|
| 2499 |
+
10. What is the chain rule and why is it essential for backprop?
|
| 2500 |
+
</div>
|
| 2501 |
`,
|
| 2502 |
math: `
|
| 2503 |
<h3>The 4 Fundamental Equations of Backprop</h3>
|
|
|
|
| 2575 |
<td>Computer vision (rotations, flips, crops)</td>
|
| 2576 |
</tr>
|
| 2577 |
</table>
|
| 2578 |
+
|
| 2579 |
+
<h3>Weight Initialization</h3>
|
| 2580 |
+
<p>Proper initialization prevents vanishing/exploding gradients from the very first step.</p>
|
| 2581 |
+
<table>
|
| 2582 |
+
<tr><th>Method</th><th>Formula</th><th>Best For</th></tr>
|
| 2583 |
+
<tr><td>Zero Init</td><td>All w = 0</td><td>❌ Never use! Breaks symmetry</td></tr>
|
| 2584 |
+
<tr><td>Random</td><td>w ~ N(0, 0.01)</td><td>⚠️ Vanishes in deep nets</td></tr>
|
| 2585 |
+
<tr><td><strong>Xavier (Glorot)</strong></td><td>w ~ N(0, √(2/(n_in + n_out)))</td><td>✅ Sigmoid, Tanh</td></tr>
|
| 2586 |
+
<tr><td><strong>He (Kaiming)</strong></td><td>w ~ N(0, √(2/n_in))</td><td>✅ ReLU (default)</td></tr>
|
| 2587 |
+
</table>
|
| 2588 |
`,
|
| 2589 |
applications: `
|
| 2590 |
<div class="info-box">
|
|
|
|
| 2596 |
• Data Augmentation for images
|
| 2597 |
</div>
|
| 2598 |
</div>
|
| 2599 |
+
|
| 2600 |
+
<h3>Dropout vs Batch Normalization</h3>
|
| 2601 |
+
<table>
|
| 2602 |
+
<tr><th>Feature</th><th>Dropout</th><th>Batch Normalization</th></tr>
|
| 2603 |
+
<tr><td>Purpose</td><td>Regularization</td><td>Faster training + mild regularization</td></tr>
|
| 2604 |
+
<tr><td>Mechanism</td><td>Randomly drops neurons</td><td>Normalizes layer inputs</td></tr>
|
| 2605 |
+
<tr><td>Training vs Test</td><td>Different behavior</td><td>Different behavior</td></tr>
|
| 2606 |
+
<tr><td>Combined?</td><td colspan="2">Yes, use BatchNorm <em>before</em> Dropout</td></tr>
|
| 2607 |
+
</table>
|
| 2608 |
+
|
| 2609 |
+
<div class="callout tip">
|
| 2610 |
+
<div class="callout-title">🎤 Probable Interview Questions</div>
|
| 2611 |
+
1. Why can't we initialize all weights to zero?<br>
|
| 2612 |
+
2. Difference between Xavier and He initialization?<br>
|
| 2613 |
+
3. What is the vanishing gradient problem?<br>
|
| 2614 |
+
4. How does Dropout prevent overfitting?<br>
|
| 2615 |
+
5. Can we use Dropout at test time?<br>
|
| 2616 |
+
6. Why is He initialization used with ReLU?<br>
|
| 2617 |
+
7. What happens if weights are too large initially?<br>
|
| 2618 |
+
8. Does Batch Normalization eliminate the need for Dropout?<br>
|
| 2619 |
+
9. L1 vs L2 regularization — when to use each?<br>
|
| 2620 |
+
10. What is the exploding gradient problem and how to fix it?
|
| 2621 |
+
</div>
|
| 2622 |
`,
|
| 2623 |
math: `
|
| 2624 |
<h3>L2 Regularization (Weight Decay)</h3>
|
|
|
|
| 2686 |
(where n = number of neurons that can be dropped)<br><br>
|
| 2687 |
Each forward pass is a different architecture!
|
| 2688 |
</div>
|
| 2689 |
+
|
| 2690 |
+
<h3>Weight Initialization Mathematics</h3>
|
| 2691 |
+
|
| 2692 |
+
<h4>Xavier Initialization (for Sigmoid/Tanh)</h4>
|
| 2693 |
+
<div class="formula">
|
| 2694 |
+
w ~ N(0, σ²) where σ² = 2 / (n_in + n_out)<br><br>
|
| 2695 |
+
Goal: Keep Var(output) ≈ Var(input) across layers
|
| 2696 |
+
</div>
|
| 2697 |
+
|
| 2698 |
+
<h4>He Initialization (for ReLU)</h4>
|
| 2699 |
+
<div class="formula">
|
| 2700 |
+
w ~ N(0, σ²) where σ² = 2 / n_in<br><br>
|
| 2701 |
+
ReLU zeros out ~50% of activations, so variance is halved → multiply by 2 to compensate!
|
| 2702 |
+
</div>
|
| 2703 |
+
|
| 2704 |
+
<div class="callout insight">
|
| 2705 |
+
<div class="callout-title">📝 Paper & Pain: Why Zero Init Fails</div>
|
| 2706 |
+
If all weights = 0, every neuron computes the <strong>same output</strong>.<br>
|
| 2707 |
+
All gradients are <strong>identical</strong> → All weights update the same way.<br>
|
| 2708 |
+
Result: All neurons stay identical forever! The network is as good as <strong>1 neuron</strong>.<br><br>
|
| 2709 |
+
<strong>Random Init:</strong> w ~ N(0, 0.01) works for shallow networks but gradients shrink exponentially in deep ones.<br>
|
| 2710 |
+
<strong>Xavier:</strong> Calibrates variance based on layer width → stable gradients for Sigmoid/Tanh.<br>
|
| 2711 |
+
<strong>He:</strong> Accounts for ReLU zeroing out negative half → default for modern networks.
|
| 2712 |
+
</div>
|
| 2713 |
`
|
| 2714 |
},
|
| 2715 |
"batch-norm": {
|
|
|
|
| 4396 |
},
|
| 4397 |
"bert": {
|
| 4398 |
overview: `
|
| 4399 |
+
<h3>BERT: Bidirectional Encoder Representations from Transformers</h3>
|
| 4400 |
+
<p><strong>Paper:</strong> "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2018)</p>
|
| 4401 |
+
<p><strong>arXiv:</strong> <a href="https://arxiv.org/abs/1810.04805" target="_blank">1810.04805</a></p>
|
| 4402 |
|
| 4403 |
+
<div class="callout insight">
|
| 4404 |
+
<div class="callout-title">🎯 Key Innovation</div>
|
| 4405 |
+
BERT revolutionized NLP by introducing <strong>bidirectional pre-training</strong>. Unlike previous models (GPT, ELMo) that processed text left-to-right or combined shallow bidirectional representations, BERT deeply integrates left AND right context in all layers simultaneously.
|
| 4406 |
+
</div>
|
| 4407 |
|
| 4408 |
+
<h3>Why BERT Matters</h3>
|
| 4409 |
<ul>
|
| 4410 |
+
<li><strong>Transfer Learning for NLP:</strong> Pre-train once on massive unlabeled data, fine-tune on many tasks</li>
|
| 4411 |
+
<li><strong>State-of-the-Art Results:</strong> Set new records on 11 NLP tasks including SQuAD, GLUE, and SWAG</li>
|
| 4412 |
+
<li><strong>Efficiency:</strong> Fine-tuning requires minimal task-specific architecture changes</li>
|
| 4413 |
+
<li><strong>Accessibility:</strong> Google released pre-trained models publicly</li>
|
| 4414 |
</ul>
|
| 4415 |
|
| 4416 |
+
<h3>Pre-training Corpus</h3>
|
| 4417 |
+
<div class="info-box">
|
| 4418 |
+
<div class="box-title">📚 Training Data</div>
|
| 4419 |
+
<div class="box-content">
|
| 4420 |
+
<strong>BooksCorpus:</strong> 800M words from 11,038 unpublished books<br>
|
| 4421 |
+
<strong>English Wikipedia:</strong> 2,500M words (text passages only, no lists/tables/headers)<br>
|
| 4422 |
+
<strong>Total:</strong> ~3.3 billion words
|
| 4423 |
+
</div>
|
| 4424 |
+
</div>
|
| 4425 |
+
|
| 4426 |
+
<h3>Pre-training Tasks</h3>
|
| 4427 |
+
<div class="list-item">
|
| 4428 |
+
<div class="list-num">01</div>
|
| 4429 |
+
<div><strong>Masked Language Modeling (MLM):</strong> Randomly mask 15% of tokens and predict them using bidirectional context. Example: "The cat [MASK] on the mat" → predict "sat"</div>
|
| 4430 |
+
</div>
|
| 4431 |
+
<div class="list-item">
|
| 4432 |
+
<div class="list-num">02</div>
|
| 4433 |
+
<div><strong>Next Sentence Prediction (NSP):</strong> Given sentence pairs (A, B), predict if B actually follows A in the corpus. Helps with tasks like QA and NLI that require understanding sentence relationships.</div>
|
| 4434 |
+
</div>
|
| 4435 |
+
|
| 4436 |
<div class="callout tip">
|
| 4437 |
+
<div class="callout-title">💡 The BERT Fine-tuning Paradigm</div>
|
| 4438 |
+
1. <strong>Pre-train</strong> BERT on BooksCorpus + Wikipedia (days/weeks on TPUs)<br>
|
| 4439 |
+
2. <strong>Download</strong> pre-trained weights from Google<br>
|
| 4440 |
+
3. <strong>Add</strong> task-specific head (1 layer for classification/QA/NER)<br>
|
| 4441 |
+
4. <strong>Fine-tune</strong> entire model on your dataset (hours on single GPU)<br>
|
| 4442 |
+
5. <strong>Achieve SOTA</strong> with as few as 3,600 labeled examples!
|
| 4443 |
</div>
|
| 4444 |
`,
|
| 4445 |
concepts: `
|
| 4446 |
<h3>BERT Architecture</h3>
|
| 4447 |
+
<p>BERT uses a multi-layer bidirectional Transformer encoder based on Vaswani et al. (2017).</p>
|
| 4448 |
+
|
| 4449 |
+
<h3>Model Variants</h3>
|
| 4450 |
+
<table>
|
| 4451 |
+
<tr>
|
| 4452 |
+
<th>Model</th>
|
| 4453 |
+
<th>Layers (L)</th>
|
| 4454 |
+
<th>Hidden Size (H)</th>
|
| 4455 |
+
<th>Attention Heads (A)</th>
|
| 4456 |
+
<th>Parameters</th>
|
| 4457 |
+
</tr>
|
| 4458 |
+
<tr>
|
| 4459 |
+
<td>BERT<sub>BASE</sub></td>
|
| 4460 |
+
<td>12</td>
|
| 4461 |
+
<td>768</td>
|
| 4462 |
+
<td>12</td>
|
| 4463 |
+
<td>110M</td>
|
| 4464 |
+
</tr>
|
| 4465 |
+
<tr>
|
| 4466 |
+
<td>BERT<sub>LARGE</sub></td>
|
| 4467 |
+
<td>24</td>
|
| 4468 |
+
<td>1024</td>
|
| 4469 |
+
<td>16</td>
|
| 4470 |
+
<td>340M</td>
|
| 4471 |
+
</tr>
|
| 4472 |
+
</table>
|
| 4473 |
+
|
| 4474 |
+
<p><em>Note: BERT<sub>BASE</sub> was designed to match GPT's size for comparison.</em></p>
|
| 4475 |
+
|
| 4476 |
+
<h3>Input Representation</h3>
|
| 4477 |
+
<p>BERT's input embedding is the sum of three components:</p>
|
| 4478 |
+
|
| 4479 |
<div class="list-item">
|
| 4480 |
<div class="list-num">01</div>
|
| 4481 |
+
<div><strong>Token Embeddings:</strong> WordPiece tokenization with 30,000 token vocabulary. Handles unknown words by splitting into subwords (e.g., "playing" → "play" + "##ing")</div>
|
| 4482 |
</div>
|
| 4483 |
<div class="list-item">
|
| 4484 |
<div class="list-num">02</div>
|
| 4485 |
+
<div><strong>Segment Embeddings:</strong> Learned embedding to distinguish sentence A from sentence B (E<sub>A</sub> or E<sub>B</sub>)</div>
|
| 4486 |
</div>
|
| 4487 |
<div class="list-item">
|
| 4488 |
<div class="list-num">03</div>
|
| 4489 |
+
<div><strong>Position Embeddings:</strong> Learned positional encodings (unlike Transformers' sinusoidal), supports sequences up to 512 tokens</div>
|
| 4490 |
</div>
|
| 4491 |
+
|
| 4492 |
+
<div class="formula">
|
| 4493 |
+
Input = Token_Embedding + Segment_Embedding + Position_Embedding
|
| 4494 |
+
</div>
|
| 4495 |
+
|
| 4496 |
+
<h3>Special Tokens</h3>
|
| 4497 |
+
<div class="info-box">
|
| 4498 |
+
<div class="box-title">🏷️ Special Token Usage</div>
|
| 4499 |
+
<div class="box-content">
|
| 4500 |
+
<strong>[CLS]:</strong> Prepended to every input. Final hidden state used for classification tasks<br>
|
| 4501 |
+
<strong>[SEP]:</strong> Separates sentence pairs and marks sequence end<br>
|
| 4502 |
+
<strong>[MASK]:</strong> Replaces masked tokens during pre-training (not used during fine-tuning)
|
| 4503 |
+
</div>
|
| 4504 |
</div>
|
| 4505 |
|
| 4506 |
+
<h4>Example Input Format</h4>
|
| 4507 |
+
<div class="formula">
|
| 4508 |
+
[CLS] My dog is cute [SEP] He likes playing [SEP]<br>
|
| 4509 |
+
<br>
|
| 4510 |
+
Tokens: [CLS] My dog is cute [SEP] He likes play ##ing [SEP]<br>
|
| 4511 |
+
Segments: E_A E_A E_A E_A E_A E_A E_B E_B E_B E_B E_B<br>
|
| 4512 |
+
Positions: 0 1 2 3 4 5 6 7 8 9 10
|
| 4513 |
+
</div>
|
| 4514 |
+
|
| 4515 |
+
<h3>Fine-tuning for Different Tasks</h3>
|
| 4516 |
<table>
|
| 4517 |
+
<tr><th>Task Type</th><th>Input Format</th><th>Output</th></tr>
|
| 4518 |
+
<tr>
|
| 4519 |
+
<td>Classification</td>
|
| 4520 |
+
<td>[CLS] text [SEP]</td>
|
| 4521 |
+
<td>[CLS] representation → classifier</td>
|
| 4522 |
+
</tr>
|
| 4523 |
+
<tr>
|
| 4524 |
+
<td>Sentence Pair</td>
|
| 4525 |
+
<td>[CLS] sent A [SEP] sent B [SEP]</td>
|
| 4526 |
+
<td>[CLS] representation → classifier</td>
|
| 4527 |
+
</tr>
|
| 4528 |
+
<tr>
|
| 4529 |
+
<td>Question Answering</td>
|
| 4530 |
+
<td>[CLS] question [SEP] passage [SEP]</td>
|
| 4531 |
+
<td>Start/End span vectors over passage tokens</td>
|
| 4532 |
+
</tr>
|
| 4533 |
+
<tr>
|
| 4534 |
+
<td>Token Classification</td>
|
| 4535 |
+
<td>[CLS] text [SEP]</td>
|
| 4536 |
+
<td>Each token representation → label</td>
|
| 4537 |
+
</tr>
|
| 4538 |
</table>
|
| 4539 |
`,
|
| 4540 |
math: `
|
| 4541 |
+
<h3>Pre-training Objective</h3>
|
| 4542 |
+
<p>BERT simultaneously optimizes two unsupervised tasks:</p>
|
| 4543 |
+
|
| 4544 |
+
<div class="formula">
|
| 4545 |
+
L = L<sub>MLM</sub> + L<sub>NSP</sub>
|
| 4546 |
+
</div>
|
| 4547 |
+
|
| 4548 |
<h3>Masked Language Modeling (MLM)</h3>
|
| 4549 |
+
|
| 4550 |
+
<div class="callout insight">
|
| 4551 |
+
<div class="callout-title">📝 Paper & Pain: The Masking Strategy</div>
|
| 4552 |
+
<strong>Problem:</strong> Standard left-to-right language modeling can't capture bidirectional context.<br>
|
| 4553 |
+
<strong>Solution:</strong> Randomly mask 15% of tokens and predict them using full context.<br><br>
|
| 4554 |
+
|
| 4555 |
+
<strong>However:</strong> [MASK] token doesn't appear during fine-tuning!<br>
|
| 4556 |
+
<strong>Clever Fix:</strong> Of the 15% selected tokens:<br>
|
| 4557 |
+
• 80% → Replace with [MASK]<br>
|
| 4558 |
+
• 10% → Replace with random token<br>
|
| 4559 |
+
• 10% → Keep unchanged<br><br>
|
| 4560 |
+
|
| 4561 |
+
This forces the model to maintain context representations for ALL tokens!
|
| 4562 |
+
</div>
|
| 4563 |
+
|
| 4564 |
+
<h4>MLM Loss Derivation</h4>
|
| 4565 |
+
<p>Let's work through the MLM objective step by step:</p>
|
| 4566 |
|
| 4567 |
<div class="formula">
|
| 4568 |
+
Given input sequence: x = [x₁, x₂, ..., x_n]<br>
|
| 4569 |
+
Masked sequence: x̃ = [x̃₁, x̃₂, ..., x̃_n]<br>
|
| 4570 |
+
<br>
|
| 4571 |
+
Let M = {i₁, i₂, ..., i_m} be indices of masked tokens<br>
|
| 4572 |
+
<br>
|
| 4573 |
+
For each masked position i ∈ M:<br>
|
| 4574 |
+
h_i = BERT(x̃)_i (hidden state at position i)<br>
|
| 4575 |
+
logits_i = W · h_i + b (W ∈ ℝ^(V×H), vocab size V)<br>
|
| 4576 |
+
P(x_i | x̃) = softmax(logits_i)<br>
|
| 4577 |
<br>
|
| 4578 |
+
Cross-entropy loss per token:<br>
|
| 4579 |
+
L_i = -log P(x_i | x̃)<br>
|
| 4580 |
+
<br>
|
| 4581 |
+
Total MLM loss:<br>
|
| 4582 |
+
L_MLM = (1/|M|) Σ_{i∈M} L_i<br>
|
| 4583 |
+
L_MLM = -(1/|M|) Σ_{i∈M} log P(x_i | x̃)
|
| 4584 |
</div>
|
| 4585 |
|
| 4586 |
+
<div class="callout warning">
|
| 4587 |
+
<div class="callout-title">📊 Worked Example: MLM Calculation</div>
|
| 4588 |
+
<strong>Input:</strong> "The cat sat on the mat"<br>
|
| 4589 |
+
<strong>After masking (15%):</strong> "The [MASK] sat on the mat"<br>
|
| 4590 |
+
<strong>Target:</strong> Predict "cat" at position 2<br><br>
|
| 4591 |
+
|
| 4592 |
+
<strong>Step 1:</strong> Forward pass through BERT<br>
|
| 4593 |
+
h₂ = BERT(x̃)₂ ∈ ℝ^768 (for BERT_BASE)<br><br>
|
| 4594 |
+
|
| 4595 |
+
<strong>Step 2:</strong> Project to vocabulary space<br>
|
| 4596 |
+
logits₂ = W · h₂ + b ∈ ℝ^30000<br><br>
|
| 4597 |
+
|
| 4598 |
+
<strong>Step 3:</strong> Compute probabilities<br>
|
| 4599 |
+
P(w | x̃) = exp(logits₂[w]) / Σ_v exp(logits₂[v])<br><br>
|
| 4600 |
+
|
| 4601 |
+
<strong>Step 4:</strong> Compute loss (assume P("cat"|x̃) = 0.73)<br>
|
| 4602 |
+
L = -log(0.73) = 0.315
|
| 4603 |
+
</div>
|
| 4604 |
+
|
| 4605 |
+
<h3>Next Sentence Prediction (NSP)</h3>
|
| 4606 |
+
<p>Binary classification task to understand sentence relationships.</p>
|
| 4607 |
+
|
| 4608 |
+
<div class="formula">
|
| 4609 |
+
Input: [CLS] sentence_A [SEP] sentence_B [SEP]<br>
|
| 4610 |
+
<br>
|
| 4611 |
+
Let C = final hidden state of [CLS] token ∈ ℝ^H<br>
|
| 4612 |
+
<br>
|
| 4613 |
+
P(IsNext = True) = σ(W_NSP · C)<br>
|
| 4614 |
+
where σ = sigmoid function, W_NSP ∈ ℝ^(1×H)<br>
|
| 4615 |
+
<br>
|
| 4616 |
+
Binary cross-entropy loss:<br>
|
| 4617 |
+
L_NSP = -[y·log(ŷ) + (1-y)·log(1-ŷ)]<br>
|
| 4618 |
+
where y = 1 if B follows A, else 0
|
| 4619 |
+
</div>
|
| 4620 |
+
|
| 4621 |
+
<h4>NSP Training Data Generation</h4>
|
| 4622 |
+
<ul>
|
| 4623 |
+
<li><strong>50% IsNext:</strong> B actually follows A in corpus</li>
|
| 4624 |
+
<li><strong>50% NotNext:</strong> B sampled randomly from another document</li>
|
| 4625 |
+
</ul>
|
| 4626 |
+
|
| 4627 |
+
<h3>Fine-tuning Math: Question Answering (SQuAD)</h3>
|
| 4628 |
+
|
| 4629 |
+
<div class="formula">
|
| 4630 |
+
Input: [CLS] question [SEP] paragraph [SEP]<br>
|
| 4631 |
+
<br>
|
| 4632 |
+
Let T_i = final hidden state for token i in paragraph<br>
|
| 4633 |
+
<br>
|
| 4634 |
+
Start position logits: S_i = W_start · T_i<br>
|
| 4635 |
+
End position logits: E_i = W_end · T_i<br>
|
| 4636 |
+
<br>
|
| 4637 |
+
P(start = i) = softmax(S)_i<br>
|
| 4638 |
+
P(end = j) = softmax(E)_j<br>
|
| 4639 |
+
<br>
|
| 4640 |
+
Answer span = tokens from position i to j<br>
|
| 4641 |
+
<br>
|
| 4642 |
+
Training loss:<br>
|
| 4643 |
+
L = -log P(start = i*) - log P(end = j*)<br>
|
| 4644 |
+
where i*, j* are ground truth positions
|
| 4645 |
</div>
|
| 4646 |
`,
|
| 4647 |
applications: `
|
| 4648 |
+
<h3>SQuAD Benchmark Performance</h3>
|
| 4649 |
+
|
| 4650 |
+
<div class="info-box">
|
| 4651 |
+
<div class="box-title">🏆 Stanford Question Answering Dataset (SQuAD)</div>
|
| 4652 |
+
<div class="box-content">
|
| 4653 |
+
<strong>SQuAD 1.1:</strong> 100,000+ question-answer pairs on 500+ Wikipedia articles. Every question has an answer span in the passage.<br><br>
|
| 4654 |
+
<strong>SQuAD 2.0:</strong> Adds 50,000+ unanswerable questions. Models must determine when no answer exists.<br><br>
|
| 4655 |
+
<strong>Evaluation Metrics:</strong><br>
|
| 4656 |
+
• <strong>EM (Exact Match):</strong> % of predictions matching ground truth exactly<br>
|
| 4657 |
+
• <strong>F1:</strong> Token-level overlap between prediction and ground truth
|
| 4658 |
+
</div>
|
| 4659 |
+
</div>
|
| 4660 |
+
|
| 4661 |
+
<h3>SQuAD 1.1 Results</h3>
|
| 4662 |
+
<table>
|
| 4663 |
+
<tr><th>Model</th><th>EM</th><th>F1</th></tr>
|
| 4664 |
+
<tr><td>Human Performance</td><td>82.3</td><td>91.2</td></tr>
|
| 4665 |
+
<tr><td>BERT<sub>BASE</sub></td><td>80.8</td><td>88.5</td></tr>
|
| 4666 |
+
<tr><td>BERT<sub>LARGE</sub></td><td><strong>84.1</strong></td><td><strong>90.9</strong></td></tr>
|
| 4667 |
+
</table>
|
| 4668 |
+
<p><em>BERT<sub>LARGE</sub> surpassed human performance on EM!</em></p>
|
| 4669 |
+
|
| 4670 |
+
<h3>SQuAD 2.0 Results</h3>
|
| 4671 |
+
<table>
|
| 4672 |
+
<tr><th>Model</th><th>EM</th><th>F1</th></tr>
|
| 4673 |
+
<tr><td>Human Performance</td><td>86.9</td><td>89.5</td></tr>
|
| 4674 |
+
<tr><td>BERT<sub>BASE</sub></td><td>73.7</td><td>76.3</td></tr>
|
| 4675 |
+
<tr><td>BERT<sub>LARGE</sub></td><td><strong>78.7</strong></td><td><strong>81.9</strong></td></tr>
|
| 4676 |
+
</table>
|
| 4677 |
+
|
| 4678 |
+
<div class="callout tip">
|
| 4679 |
+
<div class="callout-title">💡 Example SQuAD Question</div>
|
| 4680 |
+
<strong>Passage:</strong> "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France."<br><br>
|
| 4681 |
+
<strong>Question:</strong> "In what country is Normandy located?"<br><br>
|
| 4682 |
+
<strong>BERT Answer:</strong> "France" ✓<br>
|
| 4683 |
+
<strong>Start Token:</strong> position 32<br>
|
| 4684 |
+
<strong>End Token:</strong> position 32
|
| 4685 |
+
</div>
|
| 4686 |
+
|
| 4687 |
+
<h3>GLUE Benchmark (General Language Understanding Evaluation)</h3>
|
| 4688 |
+
<p>BERT set new state-of-the-art on all 9 GLUE tasks:</p>
|
| 4689 |
+
<table>
|
| 4690 |
+
<tr><th>Task</th><th>Metric</th><th>Previous SOTA</th><th>BERT<sub>LARGE</sub></th></tr>
|
| 4691 |
+
<tr><td>MNLI (NLI)</td><td>Acc</td><td>86.6</td><td><strong>86.7</strong></td></tr>
|
| 4692 |
+
<tr><td>QQP (Paraphrase)</td><td>F1</td><td>66.1</td><td><strong>72.1</strong></td></tr>
|
| 4693 |
+
<tr><td>QNLI (QA/NLI)</td><td>Acc</td><td>87.4</td><td><strong>92.7</strong></td></tr>
|
| 4694 |
+
<tr><td>SST-2 (Sentiment)</td><td>Acc</td><td>93.2</td><td><strong>94.9</strong></td></tr>
|
| 4695 |
+
<tr><td>CoLA (Acceptability)</td><td>Matthew's</td><td>35.0</td><td><strong>60.5</strong></td></tr>
|
| 4696 |
+
</table>
|
| 4697 |
+
|
| 4698 |
+
<h3>Additional Applications</h3>
|
| 4699 |
+
|
| 4700 |
<div class="info-box">
|
| 4701 |
+
<div class="box-title">🔍 Google Search</div>
|
| 4702 |
<div class="box-content">
|
| 4703 |
+
In October 2019, Google began using BERT for 1 in 10 English search queries, calling it the biggest leap in 5 years. BERT helps understand search intent and context.
|
|
|
|
| 4704 |
</div>
|
| 4705 |
</div>
|
| 4706 |
+
|
| 4707 |
+
<div class="info-box">
|
| 4708 |
+
<div class="box-title">🏷️ Named Entity Recognition (NER)</div>
|
| 4709 |
+
<div class="box-content">
|
| 4710 |
+
BERT excels at identifying entities (person, location, organization) in text by treating it as token classification. Each token gets a label (B-PER, I-PER, B-LOC, etc.).
|
| 4711 |
+
</div>
|
| 4712 |
+
</div>
|
| 4713 |
+
|
| 4714 |
<div class="info-box">
|
| 4715 |
<div class="box-title">📊 Text Classification</div>
|
| 4716 |
+
<div class="box-content">
|
| 4717 |
+
Sentiment analysis, topic classification, spam detection - all benefit from BERT's contextual understanding. Simply use [CLS] representation with a classifier.
|
| 4718 |
+
</div>
|
| 4719 |
+
</div>
|
| 4720 |
+
|
| 4721 |
+
<h3>Using BERT: Quick Code Example</h3>
|
| 4722 |
+
<div class="formula">
|
| 4723 |
+
# Using Hugging Face Transformers<br>
|
| 4724 |
+
from transformers import BertTokenizer, BertForQuestionAnswering<br>
|
| 4725 |
+
import torch<br>
|
| 4726 |
+
<br>
|
| 4727 |
+
# Load pre-trained model and tokenizer<br>
|
| 4728 |
+
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')<br>
|
| 4729 |
+
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')<br>
|
| 4730 |
+
<br>
|
| 4731 |
+
# Example<br>
|
| 4732 |
+
question = "What is BERT?"<br>
|
| 4733 |
+
context = "BERT is a bidirectional Transformer for NLP."<br>
|
| 4734 |
+
<br>
|
| 4735 |
+
# Tokenize and get answer<br>
|
| 4736 |
+
inputs = tokenizer(question, context, return_tensors='pt')<br>
|
| 4737 |
+
outputs = model(**inputs)<br>
|
| 4738 |
+
<br>
|
| 4739 |
+
start_idx = torch.argmax(outputs.start_logits)<br>
|
| 4740 |
+
end_idx = torch.argmax(outputs.end_logits)<br>
|
| 4741 |
+
answer = tokenizer.convert_tokens_to_string(<br>
|
| 4742 |
+
tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx+1])<br>
|
| 4743 |
+
)<br>
|
| 4744 |
+
print(answer) # "a bidirectional Transformer for NLP"
|
| 4745 |
</div>
|
| 4746 |
`
|
| 4747 |
},
|