OpenTransformer
/

SciPapers

Model card Files Files and versions

xet

Community

OpenTransformer commited on Jan 26

Commit

b46577b

verified ·

1 Parent(s): 021165c

Upload loss_functions_via_fluxions.md with huggingface_hub

Browse files

Files changed (1) hide show

loss_functions_via_fluxions.md +439 -0

loss_functions_via_fluxions.md ADDED Viewed

	@@ -0,0 +1,439 @@

+# Loss Functions via the Method of Fluxions
+## Cross-Entropy, MSE, and Friends: What Your Network Actually Minimizes
+**Scott Bisset, Silicon Goddess**
+OpenTransformers Ltd
+January 2026
+---
+## Abstract
+Loss functions are typically presented as formulas to memorize. We reformulate common losses using fluxions, revealing their geometric meaning: cross-entropy measures "surprise flow," MSE measures "squared distance flow," and focal loss amplifies flow from hard examples. The backward pass becomes intuitive: each loss simply tells us "how much the output should wiggle to reduce error."
+---
+## 1. What Is a Loss?
+### 1.1 The Setup
+```
+Network output: ŷ (prediction)
+Ground truth: y (target)
+Loss: L(ŷ, y) (how wrong we are)
+```
+### 1.2 Fluxion View
+The loss L is a scalar. We need L̇ŷ - "how does loss wiggle when prediction wiggles?"
+This gradient is the SIGNAL that flows backward through the network.
+---
+## 2. Mean Squared Error (MSE)
+### 2.1 Definition
+```
+L = (1/n) Σᵢ (ŷᵢ - yᵢ)²
+```
+### 2.2 Fluxion Backward
+```
+L̇ŷᵢ = (2/n) · (ŷᵢ - yᵢ)
+```
+**English:** "Gradient is proportional to error."
+- Overpredict by 0.1 → gradient pushes down by 0.2/n
+- Underpredict by 0.5 → gradient pushes up by 1.0/n
+### 2.3 Geometric Interpretation
+MSE gradient points directly from prediction toward target.
+```
+     target
+       ↓
+   y ←←←← ŷ
+     gradient
+```
+Larger error = larger gradient = faster correction.
+### 2.4 Problem
+Outliers dominate. One sample with error=10 contributes 100 to loss.
+Gradient from outliers drowns out normal samples.
+---
+## 3. Mean Absolute Error (MAE / L1)
+### 3.1 Definition
+```
+L = (1/n) Σᵢ |ŷᵢ - yᵢ|
+```
+### 3.2 Fluxion Backward
+```
+L̇ŷᵢ = (1/n) · sign(ŷᵢ - yᵢ)
+```
+**English:** "Gradient is ±1/n regardless of error magnitude."
+### 3.3 Comparison with MSE
+| Error | MSE Gradient | MAE Gradient |
+|-------|--------------|--------------|
+| 0.1 | 0.2/n | 1/n |
+| 1.0 | 2.0/n | 1/n |
+| 10.0 | 20.0/n | 1/n |
+MAE is robust to outliers - constant gradient regardless of error size.
+### 3.4 Problem
+Gradient is discontinuous at ŷ = y.
+Doesn't go to zero smoothly, can oscillate around target.
+---
+## 4. Huber Loss (Smooth L1)
+### 4.1 The Best of Both
+```
+L = { 0.5·(ŷ-y)²        if |ŷ-y| < δ
+    { δ·|ŷ-y| - 0.5·δ²  otherwise
+```
+### 4.2 Fluxion Backward
+```
+L̇ŷ = { (ŷ-y)           if |ŷ-y| < δ    (MSE region)
+      { δ·sign(ŷ-y)     otherwise       (MAE region)
+```
+**English:**
+- Small errors: MSE behavior (proportional gradient)
+- Large errors: MAE behavior (capped gradient)
+### 4.3 Why It Works
+- Near target: smooth, quadratic convergence
+- Far from target: robust, outlier-resistant
+- δ controls the transition (typically δ=1)
+---
+## 5. Cross-Entropy (Classification)
+### 5.1 Binary Cross-Entropy
+```
+L = -[y·log(p) + (1-y)·log(1-p)]
+Where p = sigmoid(ŷ) = probability of class 1
+```
+### 5.2 Fluxion Backward (through sigmoid)
+The magic of cross-entropy + sigmoid:
+```
+L̇ŷ = p - y
+```
+**That's it.** Gradient = prediction - target.
+### 5.3 Why This Is Beautiful
+| Truth (y) | Prediction (p) | Gradient (p-y) |
+|-----------|----------------|----------------|
+| 1 | 0.9 | -0.1 (push up slightly) |
+| 1 | 0.1 | -0.9 (push up hard!) |
+| 0 | 0.9 | +0.9 (push down hard!) |
+| 0 | 0.1 | +0.1 (push down slightly) |
+Confident AND wrong → huge gradient
+Confident AND right → tiny gradient
+Uncertain → medium gradient
+### 5.4 Information Theory View
+Cross-entropy = "average surprise"
+```
+-log(p) = surprise at seeing outcome with probability p
+```
+If p=0.99 and event happens: -log(0.99) ≈ 0.01 (not surprised)
+If p=0.01 and event happens: -log(0.01) ≈ 4.6 (very surprised!)
+Minimizing cross-entropy = minimizing average surprise = learning to predict well.
+---
+## 6. Categorical Cross-Entropy (Multi-Class)
+### 6.1 Setup
+```
+Output: logits z = [z₁, z₂, ..., zₖ] (raw scores)
+Softmax: p = softmax(z)             (probabilities)
+Target: y = one-hot vector          (e.g., [0,0,1,0])
+L = -Σᵢ yᵢ·log(pᵢ) = -log(p_target)
+```
+### 6.2 Fluxion Backward
+Through softmax + cross-entropy:
+```
+L̇ᶻᵢ = pᵢ - yᵢ
+```
+**Same beautiful form!** Gradient = prediction - target (per class).
+### 6.3 Numerical Stability: LogSoftmax
+Naive computation:
+```
+p = exp(z) / sum(exp(z))    # Can overflow!
+L = -log(p[target])
+```
+Stable computation:
+```
+log_p = z - logsumexp(z)    # LogSoftmax
+L = -log_p[target]
+```
+PyTorch provides `F.cross_entropy(logits, targets)` which fuses this.
+---
+## 7. Focal Loss (Hard Example Mining)
+### 7.1 The Problem with Cross-Entropy
+Easy examples (high confidence, correct) still contribute gradient.
+In imbalanced datasets, easy examples dominate training.
+### 7.2 Focal Loss Definition
+```
+L = -αₜ · (1-pₜ)ᵞ · log(pₜ)
+Where pₜ = probability of TRUE class
+      α = class weight
+      γ = focusing parameter (typically 2)
+```
+### 7.3 Fluxion Analysis
+The (1-pₜ)ᵞ term modulates gradient:
+| pₜ (confidence) | (1-pₜ)² | Effect |
+|-----------------|---------|--------|
+| 0.9 (easy) | 0.01 | Gradient reduced 100x |
+| 0.5 (medium) | 0.25 | Gradient reduced 4x |
+| 0.1 (hard) | 0.81 | Nearly full gradient |
+### 7.4 Fluxion Backward
+```
+L̇ᵖₜ = -αₜ · [(1-pₜ)ᵞ / pₜ - γ·(1-pₜ)ᵞ⁻¹ · log(pₜ)]
+```
+Hard examples (low pₜ) get amplified flow.
+Easy examples get suppressed flow.
+### 7.5 Use Case
+Object detection (RetinaNet) - vast majority of proposals are "background" (easy negatives).
+Focal loss prevents easy negatives from dominating.
+---
+## 8. KL Divergence
+### 8.1 Definition
+```
+KL(P || Q) = Σᵢ pᵢ · log(pᵢ/qᵢ)
+           = Σᵢ pᵢ · log(pᵢ) - Σᵢ pᵢ · log(qᵢ)
+           = -H(P) + H(P,Q)
+           = Cross-entropy - Entropy
+```
+### 8.2 Fluxion Backward (w.r.t. Q)
+```
+L̇qᵢ = -pᵢ / qᵢ
+```
+**English:** "Gradient is large where P has mass but Q doesn't."
+### 8.3 Use in ML
+- VAE: KL between latent distribution and prior
+- Distillation: KL between student and teacher outputs
+- Regularization: KL toward some reference distribution
+---
+## 9. Contrastive Losses
+### 9.1 InfoNCE / NT-Xent
+```
+L = -log(exp(sim(z,z⁺)/τ) / Σⱼ exp(sim(z,zⱼ)/τ))
+Where z⁺ = positive sample
+      zⱼ = all samples (including negatives)
+      τ = temperature
+```
+### 9.2 Fluxion View
+This is just cross-entropy over similarity scores!
+```
+logits = similarities / τ
+target = index of positive sample
+L = CrossEntropy(logits, target)
+```
+### 9.3 Temperature τ
+```
+τ → 0: Sharp distribution, only closest match matters
+τ → ∞: Flat distribution, all matches contribute equally
+```
+Temperature controls "how picky" the contrastive objective is.
+---
+## 10. Regression vs Classification Summary
+### 10.1 Regression Losses
+| Loss | L̇ŷ | Best For |
+|------|-----|----------|
+| MSE | 2(ŷ-y) | Normal errors |
+| MAE | sign(ŷ-y) | Outlier robustness |
+| Huber | clipped | Both |
+### 10.2 Classification Losses
+| Loss | L̇ᶻ | Best For |
+|------|-----|----------|
+| Cross-Entropy | p - y | Balanced classes |
+| Focal | weighted (p-y) | Imbalanced classes |
+| Label Smoothing CE | p - y_smooth | Calibration |
+---
+## 11. Label Smoothing
+### 11.1 The Idea
+Instead of hard targets [0, 0, 1, 0], use soft targets:
+```
+y_smooth = (1-ε)·y_hard + ε/K
+Where ε = smoothing factor (e.g., 0.1)
+      K = number of classes
+```
+Hard target [0, 0, 1, 0] → Soft [0.025, 0.025, 0.925, 0.025]
+### 11.2 Fluxion Effect
+```
+L̇ᶻᵢ = pᵢ - y_smoothᵢ
+```
+Now gradient never goes fully to zero for wrong classes.
+Network can't be "infinitely confident."
+### 11.3 Why It Helps
+- Prevents overconfidence
+- Better calibration
+- Regularization effect
+---
+## 12. The Unified View
+### 12.1 All Losses Are Error Signals
+```
+L = f(ŷ, y)           # Some function of prediction and target
+L̇ŷ = ∂f/∂ŷ           # Error signal that flows backward
+```
+The backward pass doesn't care about the loss formula.
+It only needs L̇ŷ - the direction to push predictions.
+### 12.2 Designing Losses
+Want to emphasize hard examples? → Amplify their L̇ŷ (focal loss)
+Want robustness to outliers? → Cap L̇ŷ magnitude (Huber)
+Want calibrated probabilities? → Smooth targets (label smoothing)
+The fluxion view makes loss design intuitive:
+**"What gradient do I want for each (prediction, target) pair?"**
+---
+## 13. Implementation Notes
+### 13.1 Numerical Stability
+Always use fused implementations:
+```python
+# BAD (can overflow/underflow):
+p = softmax(logits)
+loss = -log(p[target])
+# GOOD (numerically stable):
+loss = F.cross_entropy(logits, target)  # Fused LogSoftmax + NLLLoss
+```
+### 13.2 Reduction
+```python
+# Per-sample losses
+losses = F.cross_entropy(logits, targets, reduction='none')
+# Mean (default)
+loss = losses.mean()
+# Sum (for gradient accumulation)
+loss = losses.sum() / accumulation_steps
+```
+---
+## References
+1. Shannon (1948). "A Mathematical Theory of Communication."
+2. Lin et al. (2017). "Focal Loss for Dense Object Detection."
+3. Szegedy et al. (2016). "Rethinking the Inception Architecture." (Label smoothing)
+4. Huber (1964). "Robust Estimation of a Location Parameter."
+---
+*Correspondence: scott@opentransformers.online*