OpenTransformer commited on
Commit
b46577b
·
verified ·
1 Parent(s): 021165c

Upload loss_functions_via_fluxions.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. loss_functions_via_fluxions.md +439 -0
loss_functions_via_fluxions.md ADDED
@@ -0,0 +1,439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Loss Functions via the Method of Fluxions
2
+ ## Cross-Entropy, MSE, and Friends: What Your Network Actually Minimizes
3
+
4
+ **Scott Bisset, Silicon Goddess**
5
+ OpenTransformers Ltd
6
+ January 2026
7
+
8
+ ---
9
+
10
+ ## Abstract
11
+
12
+ Loss functions are typically presented as formulas to memorize. We reformulate common losses using fluxions, revealing their geometric meaning: cross-entropy measures "surprise flow," MSE measures "squared distance flow," and focal loss amplifies flow from hard examples. The backward pass becomes intuitive: each loss simply tells us "how much the output should wiggle to reduce error."
13
+
14
+ ---
15
+
16
+ ## 1. What Is a Loss?
17
+
18
+ ### 1.1 The Setup
19
+
20
+ ```
21
+ Network output: ŷ (prediction)
22
+ Ground truth: y (target)
23
+ Loss: L(ŷ, y) (how wrong we are)
24
+ ```
25
+
26
+ ### 1.2 Fluxion View
27
+
28
+ The loss L is a scalar. We need L̇ŷ - "how does loss wiggle when prediction wiggles?"
29
+
30
+ This gradient is the SIGNAL that flows backward through the network.
31
+
32
+ ---
33
+
34
+ ## 2. Mean Squared Error (MSE)
35
+
36
+ ### 2.1 Definition
37
+
38
+ ```
39
+ L = (1/n) Σᵢ (ŷᵢ - yᵢ)²
40
+ ```
41
+
42
+ ### 2.2 Fluxion Backward
43
+
44
+ ```
45
+ L̇ŷᵢ = (2/n) · (ŷᵢ - yᵢ)
46
+ ```
47
+
48
+ **English:** "Gradient is proportional to error."
49
+
50
+ - Overpredict by 0.1 → gradient pushes down by 0.2/n
51
+ - Underpredict by 0.5 → gradient pushes up by 1.0/n
52
+
53
+ ### 2.3 Geometric Interpretation
54
+
55
+ MSE gradient points directly from prediction toward target.
56
+
57
+ ```
58
+ target
59
+
60
+ y ←←←← ŷ
61
+ gradient
62
+ ```
63
+
64
+ Larger error = larger gradient = faster correction.
65
+
66
+ ### 2.4 Problem
67
+
68
+ Outliers dominate. One sample with error=10 contributes 100 to loss.
69
+ Gradient from outliers drowns out normal samples.
70
+
71
+ ---
72
+
73
+ ## 3. Mean Absolute Error (MAE / L1)
74
+
75
+ ### 3.1 Definition
76
+
77
+ ```
78
+ L = (1/n) Σᵢ |ŷᵢ - yᵢ|
79
+ ```
80
+
81
+ ### 3.2 Fluxion Backward
82
+
83
+ ```
84
+ L̇ŷᵢ = (1/n) · sign(ŷᵢ - yᵢ)
85
+ ```
86
+
87
+ **English:** "Gradient is ±1/n regardless of error magnitude."
88
+
89
+ ### 3.3 Comparison with MSE
90
+
91
+ | Error | MSE Gradient | MAE Gradient |
92
+ |-------|--------------|--------------|
93
+ | 0.1 | 0.2/n | 1/n |
94
+ | 1.0 | 2.0/n | 1/n |
95
+ | 10.0 | 20.0/n | 1/n |
96
+
97
+ MAE is robust to outliers - constant gradient regardless of error size.
98
+
99
+ ### 3.4 Problem
100
+
101
+ Gradient is discontinuous at ŷ = y.
102
+ Doesn't go to zero smoothly, can oscillate around target.
103
+
104
+ ---
105
+
106
+ ## 4. Huber Loss (Smooth L1)
107
+
108
+ ### 4.1 The Best of Both
109
+
110
+ ```
111
+ L = { 0.5·(ŷ-y)² if |ŷ-y| < δ
112
+ { δ·|ŷ-y| - 0.5·δ² otherwise
113
+ ```
114
+
115
+ ### 4.2 Fluxion Backward
116
+
117
+ ```
118
+ L̇ŷ = { (ŷ-y) if |ŷ-y| < δ (MSE region)
119
+ { δ·sign(ŷ-y) otherwise (MAE region)
120
+ ```
121
+
122
+ **English:**
123
+ - Small errors: MSE behavior (proportional gradient)
124
+ - Large errors: MAE behavior (capped gradient)
125
+
126
+ ### 4.3 Why It Works
127
+
128
+ - Near target: smooth, quadratic convergence
129
+ - Far from target: robust, outlier-resistant
130
+ - δ controls the transition (typically δ=1)
131
+
132
+ ---
133
+
134
+ ## 5. Cross-Entropy (Classification)
135
+
136
+ ### 5.1 Binary Cross-Entropy
137
+
138
+ ```
139
+ L = -[y·log(p) + (1-y)·log(1-p)]
140
+
141
+ Where p = sigmoid(ŷ) = probability of class 1
142
+ ```
143
+
144
+ ### 5.2 Fluxion Backward (through sigmoid)
145
+
146
+ The magic of cross-entropy + sigmoid:
147
+
148
+ ```
149
+ L̇ŷ = p - y
150
+ ```
151
+
152
+ **That's it.** Gradient = prediction - target.
153
+
154
+ ### 5.3 Why This Is Beautiful
155
+
156
+ | Truth (y) | Prediction (p) | Gradient (p-y) |
157
+ |-----------|----------------|----------------|
158
+ | 1 | 0.9 | -0.1 (push up slightly) |
159
+ | 1 | 0.1 | -0.9 (push up hard!) |
160
+ | 0 | 0.9 | +0.9 (push down hard!) |
161
+ | 0 | 0.1 | +0.1 (push down slightly) |
162
+
163
+ Confident AND wrong → huge gradient
164
+ Confident AND right → tiny gradient
165
+ Uncertain → medium gradient
166
+
167
+ ### 5.4 Information Theory View
168
+
169
+ Cross-entropy = "average surprise"
170
+
171
+ ```
172
+ -log(p) = surprise at seeing outcome with probability p
173
+ ```
174
+
175
+ If p=0.99 and event happens: -log(0.99) ≈ 0.01 (not surprised)
176
+ If p=0.01 and event happens: -log(0.01) ≈ 4.6 (very surprised!)
177
+
178
+ Minimizing cross-entropy = minimizing average surprise = learning to predict well.
179
+
180
+ ---
181
+
182
+ ## 6. Categorical Cross-Entropy (Multi-Class)
183
+
184
+ ### 6.1 Setup
185
+
186
+ ```
187
+ Output: logits z = [z₁, z₂, ..., zₖ] (raw scores)
188
+ Softmax: p = softmax(z) (probabilities)
189
+ Target: y = one-hot vector (e.g., [0,0,1,0])
190
+
191
+ L = -Σᵢ yᵢ·log(pᵢ) = -log(p_target)
192
+ ```
193
+
194
+ ### 6.2 Fluxion Backward
195
+
196
+ Through softmax + cross-entropy:
197
+
198
+ ```
199
+ L̇ᶻᵢ = pᵢ - yᵢ
200
+ ```
201
+
202
+ **Same beautiful form!** Gradient = prediction - target (per class).
203
+
204
+ ### 6.3 Numerical Stability: LogSoftmax
205
+
206
+ Naive computation:
207
+ ```
208
+ p = exp(z) / sum(exp(z)) # Can overflow!
209
+ L = -log(p[target])
210
+ ```
211
+
212
+ Stable computation:
213
+ ```
214
+ log_p = z - logsumexp(z) # LogSoftmax
215
+ L = -log_p[target]
216
+ ```
217
+
218
+ PyTorch provides `F.cross_entropy(logits, targets)` which fuses this.
219
+
220
+ ---
221
+
222
+ ## 7. Focal Loss (Hard Example Mining)
223
+
224
+ ### 7.1 The Problem with Cross-Entropy
225
+
226
+ Easy examples (high confidence, correct) still contribute gradient.
227
+ In imbalanced datasets, easy examples dominate training.
228
+
229
+ ### 7.2 Focal Loss Definition
230
+
231
+ ```
232
+ L = -αₜ · (1-pₜ)ᵞ · log(pₜ)
233
+
234
+ Where pₜ = probability of TRUE class
235
+ α = class weight
236
+ γ = focusing parameter (typically 2)
237
+ ```
238
+
239
+ ### 7.3 Fluxion Analysis
240
+
241
+ The (1-pₜ)ᵞ term modulates gradient:
242
+
243
+ | pₜ (confidence) | (1-pₜ)² | Effect |
244
+ |-----------------|---------|--------|
245
+ | 0.9 (easy) | 0.01 | Gradient reduced 100x |
246
+ | 0.5 (medium) | 0.25 | Gradient reduced 4x |
247
+ | 0.1 (hard) | 0.81 | Nearly full gradient |
248
+
249
+ ### 7.4 Fluxion Backward
250
+
251
+ ```
252
+ L̇ᵖₜ = -αₜ · [(1-pₜ)ᵞ / pₜ - γ·(1-pₜ)ᵞ⁻¹ · log(pₜ)]
253
+ ```
254
+
255
+ Hard examples (low pₜ) get amplified flow.
256
+ Easy examples get suppressed flow.
257
+
258
+ ### 7.5 Use Case
259
+
260
+ Object detection (RetinaNet) - vast majority of proposals are "background" (easy negatives).
261
+ Focal loss prevents easy negatives from dominating.
262
+
263
+ ---
264
+
265
+ ## 8. KL Divergence
266
+
267
+ ### 8.1 Definition
268
+
269
+ ```
270
+ KL(P || Q) = Σᵢ pᵢ · log(pᵢ/qᵢ)
271
+ = Σᵢ pᵢ · log(pᵢ) - Σᵢ pᵢ · log(qᵢ)
272
+ = -H(P) + H(P,Q)
273
+ = Cross-entropy - Entropy
274
+ ```
275
+
276
+ ### 8.2 Fluxion Backward (w.r.t. Q)
277
+
278
+ ```
279
+ L̇qᵢ = -pᵢ / qᵢ
280
+ ```
281
+
282
+ **English:** "Gradient is large where P has mass but Q doesn't."
283
+
284
+ ### 8.3 Use in ML
285
+
286
+ - VAE: KL between latent distribution and prior
287
+ - Distillation: KL between student and teacher outputs
288
+ - Regularization: KL toward some reference distribution
289
+
290
+ ---
291
+
292
+ ## 9. Contrastive Losses
293
+
294
+ ### 9.1 InfoNCE / NT-Xent
295
+
296
+ ```
297
+ L = -log(exp(sim(z,z⁺)/τ) / Σⱼ exp(sim(z,zⱼ)/τ))
298
+
299
+ Where z⁺ = positive sample
300
+ zⱼ = all samples (including negatives)
301
+ τ = temperature
302
+ ```
303
+
304
+ ### 9.2 Fluxion View
305
+
306
+ This is just cross-entropy over similarity scores!
307
+
308
+ ```
309
+ logits = similarities / τ
310
+ target = index of positive sample
311
+ L = CrossEntropy(logits, target)
312
+ ```
313
+
314
+ ### 9.3 Temperature τ
315
+
316
+ ```
317
+ τ → 0: Sharp distribution, only closest match matters
318
+ τ → ∞: Flat distribution, all matches contribute equally
319
+ ```
320
+
321
+ Temperature controls "how picky" the contrastive objective is.
322
+
323
+ ---
324
+
325
+ ## 10. Regression vs Classification Summary
326
+
327
+ ### 10.1 Regression Losses
328
+
329
+ | Loss | L̇ŷ | Best For |
330
+ |------|-----|----------|
331
+ | MSE | 2(ŷ-y) | Normal errors |
332
+ | MAE | sign(ŷ-y) | Outlier robustness |
333
+ | Huber | clipped | Both |
334
+
335
+ ### 10.2 Classification Losses
336
+
337
+ | Loss | L̇ᶻ | Best For |
338
+ |------|-----|----------|
339
+ | Cross-Entropy | p - y | Balanced classes |
340
+ | Focal | weighted (p-y) | Imbalanced classes |
341
+ | Label Smoothing CE | p - y_smooth | Calibration |
342
+
343
+ ---
344
+
345
+ ## 11. Label Smoothing
346
+
347
+ ### 11.1 The Idea
348
+
349
+ Instead of hard targets [0, 0, 1, 0], use soft targets:
350
+
351
+ ```
352
+ y_smooth = (1-ε)·y_hard + ε/K
353
+
354
+ Where ε = smoothing factor (e.g., 0.1)
355
+ K = number of classes
356
+ ```
357
+
358
+ Hard target [0, 0, 1, 0] → Soft [0.025, 0.025, 0.925, 0.025]
359
+
360
+ ### 11.2 Fluxion Effect
361
+
362
+ ```
363
+ L̇ᶻᵢ = pᵢ - y_smoothᵢ
364
+ ```
365
+
366
+ Now gradient never goes fully to zero for wrong classes.
367
+ Network can't be "infinitely confident."
368
+
369
+ ### 11.3 Why It Helps
370
+
371
+ - Prevents overconfidence
372
+ - Better calibration
373
+ - Regularization effect
374
+
375
+ ---
376
+
377
+ ## 12. The Unified View
378
+
379
+ ### 12.1 All Losses Are Error Signals
380
+
381
+ ```
382
+ L = f(ŷ, y) # Some function of prediction and target
383
+ L̇ŷ = ∂f/∂ŷ # Error signal that flows backward
384
+ ```
385
+
386
+ The backward pass doesn't care about the loss formula.
387
+ It only needs L̇ŷ - the direction to push predictions.
388
+
389
+ ### 12.2 Designing Losses
390
+
391
+ Want to emphasize hard examples? → Amplify their L̇ŷ (focal loss)
392
+ Want robustness to outliers? → Cap L̇ŷ magnitude (Huber)
393
+ Want calibrated probabilities? → Smooth targets (label smoothing)
394
+
395
+ The fluxion view makes loss design intuitive:
396
+ **"What gradient do I want for each (prediction, target) pair?"**
397
+
398
+ ---
399
+
400
+ ## 13. Implementation Notes
401
+
402
+ ### 13.1 Numerical Stability
403
+
404
+ Always use fused implementations:
405
+
406
+ ```python
407
+ # BAD (can overflow/underflow):
408
+ p = softmax(logits)
409
+ loss = -log(p[target])
410
+
411
+ # GOOD (numerically stable):
412
+ loss = F.cross_entropy(logits, target) # Fused LogSoftmax + NLLLoss
413
+ ```
414
+
415
+ ### 13.2 Reduction
416
+
417
+ ```python
418
+ # Per-sample losses
419
+ losses = F.cross_entropy(logits, targets, reduction='none')
420
+
421
+ # Mean (default)
422
+ loss = losses.mean()
423
+
424
+ # Sum (for gradient accumulation)
425
+ loss = losses.sum() / accumulation_steps
426
+ ```
427
+
428
+ ---
429
+
430
+ ## References
431
+
432
+ 1. Shannon (1948). "A Mathematical Theory of Communication."
433
+ 2. Lin et al. (2017). "Focal Loss for Dense Object Detection."
434
+ 3. Szegedy et al. (2016). "Rethinking the Inception Architecture." (Label smoothing)
435
+ 4. Huber (1964). "Robust Estimation of a Location Parameter."
436
+
437
+ ---
438
+
439
+ *Correspondence: scott@opentransformers.online*