Upload report_fair.md with huggingface_hub
Browse files- report_fair.md +126 -0
report_fair.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PlainMLP vs ResMLP: Fair Comparison on Distant Identity Task
|
| 2 |
+
|
| 3 |
+
## Executive Summary
|
| 4 |
+
|
| 5 |
+
This experiment compares a 20-layer PlainMLP against a 20-layer ResMLP on a synthetic "Distant Identity" task (Y = X), using **identical initialization** for both models to ensure a fair comparison.
|
| 6 |
+
|
| 7 |
+
**Key Finding**: With fair initialization, the PlainMLP shows **complete gradient vanishing** (gradients at layer 1 are ~10⁻¹⁹), making it essentially untrainable. The ResMLP achieves **5.3x lower loss** and maintains healthy gradient flow throughout all layers.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Experimental Setup
|
| 12 |
+
|
| 13 |
+
### Models (IDENTICAL Initialization)
|
| 14 |
+
| Property | PlainMLP | ResMLP |
|
| 15 |
+
|----------|----------|--------|
|
| 16 |
+
| Architecture | `x = ReLU(Linear(x))` | `x = x + ReLU(Linear(x))` |
|
| 17 |
+
| Layers | 20 | 20 |
|
| 18 |
+
| Hidden Dimension | 64 | 64 |
|
| 19 |
+
| Parameters | 83,200 | 83,200 |
|
| 20 |
+
| Weight Init | Kaiming He × 1/√20 | Kaiming He × 1/√20 |
|
| 21 |
+
| Bias Init | Zero | Zero |
|
| 22 |
+
|
| 23 |
+
**Critical**: Both models use **identical** weight initialization (Kaiming He scaled by 1/√num_layers). The ONLY difference is the residual connection.
|
| 24 |
+
|
| 25 |
+
### Training Configuration
|
| 26 |
+
- **Task**: Learn identity mapping Y = X
|
| 27 |
+
- **Data**: 1024 vectors, dimension 64, sampled from U(-1, 1)
|
| 28 |
+
- **Optimizer**: Adam (lr=1e-3)
|
| 29 |
+
- **Batch Size**: 64
|
| 30 |
+
- **Training Steps**: 500
|
| 31 |
+
- **Loss**: MSE
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Results
|
| 36 |
+
|
| 37 |
+
### 1. Training Loss Comparison
|
| 38 |
+
|
| 39 |
+

|
| 40 |
+
|
| 41 |
+
| Metric | PlainMLP | ResMLP |
|
| 42 |
+
|--------|----------|--------|
|
| 43 |
+
| Initial Loss | 0.333 | 13.826 |
|
| 44 |
+
| Final Loss | 0.333 | 0.063 |
|
| 45 |
+
| Loss Reduction | **0%** | **99.5%** |
|
| 46 |
+
| Improvement | - | **5.3x better** |
|
| 47 |
+
|
| 48 |
+
**Key Observation**: PlainMLP shows **zero learning** - the loss stays flat at ~0.33 throughout training. ResMLP starts with higher loss (due to accumulated residuals) but rapidly converges to 0.063.
|
| 49 |
+
|
| 50 |
+
### 2. Gradient Flow Analysis
|
| 51 |
+
|
| 52 |
+

|
| 53 |
+
|
| 54 |
+
| Layer | PlainMLP Gradient | ResMLP Gradient |
|
| 55 |
+
|-------|-------------------|-----------------|
|
| 56 |
+
| Layer 1 (earliest) | **8.65 × 10⁻¹⁹** | 3.78 × 10⁻³ |
|
| 57 |
+
| Layer 10 (middle) | ~10⁻¹⁰ | ~2.5 × 10⁻³ |
|
| 58 |
+
| Layer 20 (last) | 6.61 × 10⁻³ | 1.91 × 10⁻³ |
|
| 59 |
+
|
| 60 |
+
**Critical Finding**: PlainMLP gradients at layer 1 are essentially **zero** (10⁻¹⁹ is numerical noise). This is the **vanishing gradient problem** in its most extreme form. The network cannot learn because gradients don't reach early layers.
|
| 61 |
+
|
| 62 |
+
ResMLP maintains gradients in the 10⁻³ range across all layers - healthy for learning.
|
| 63 |
+
|
| 64 |
+
### 3. Activation Statistics
|
| 65 |
+
|
| 66 |
+

|
| 67 |
+
|
| 68 |
+
| Metric | PlainMLP | ResMLP |
|
| 69 |
+
|--------|----------|--------|
|
| 70 |
+
| Std Range | [0.0000, 0.1795] | [0.1348, 0.1767] |
|
| 71 |
+
| Layer 20 Std | ~0 | 0.135 |
|
| 72 |
+
|
| 73 |
+
**Key Observation**: PlainMLP activations **collapse to zero** in later layers. The signal is completely lost by the time it reaches the output. ResMLP maintains stable activation statistics throughout.
|
| 74 |
+
|
| 75 |
+

|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Why This Happens
|
| 80 |
+
|
| 81 |
+
### PlainMLP: Multiplicative Gradient Path
|
| 82 |
+
In PlainMLP, gradients must flow through **all 20 layers multiplicatively**:
|
| 83 |
+
|
| 84 |
+
```
|
| 85 |
+
∂L/∂x₁ = ∂L/∂x₂₀ × ∂x₂₀/∂x₁₉ × ... × ∂x₂/∂x₁
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
With small weights (scaled by 1/√20 ≈ 0.224), each multiplication shrinks the gradient. After 20 layers:
|
| 89 |
+
- Gradient scale ≈ (0.224)²⁰ ≈ 10⁻¹³ (theoretical)
|
| 90 |
+
- Actual: 10⁻¹⁹ (even worse due to ReLU zeros)
|
| 91 |
+
|
| 92 |
+
### ResMLP: Additive Gradient Path
|
| 93 |
+
In ResMLP, the identity shortcut provides a **direct gradient path**:
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
∂L/∂x₁ = ∂L/∂x₂₀ × (1 + ∂f₂₀/∂x₁₉) × ... × (1 + ∂f₂/∂x₁)
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
The "1 +" terms ensure gradients never vanish completely. Even if the residual branch gradients are small, the identity path preserves gradient flow.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## Conclusions
|
| 104 |
+
|
| 105 |
+
1. **Residual connections are essential for deep networks**: With identical initialization, PlainMLP is completely untrainable (0% loss reduction) while ResMLP achieves 99.5% loss reduction.
|
| 106 |
+
|
| 107 |
+
2. **Vanishing gradients are catastrophic**: PlainMLP gradients at layer 1 are 10⁻¹⁹ - effectively zero. No amount of training can fix this.
|
| 108 |
+
|
| 109 |
+
3. **The identity shortcut is the key**: The only architectural difference is `x = f(x)` vs `x = x + f(x)`, yet this makes the difference between a dead network and a functional one.
|
| 110 |
+
|
| 111 |
+
4. **Fair comparison matters**: The previous experiment gave PlainMLP standard Kaiming init while ResMLP had scaled init. This fair comparison shows the true power of residual connections.
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## Reproducibility
|
| 116 |
+
|
| 117 |
+
```bash
|
| 118 |
+
cd projects/resmlp_comparison
|
| 119 |
+
python experiment_fair.py
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
All results are saved to `results_fair.json` and plots to `plots_fair/`.
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
*Experiment conducted with PyTorch, random seed 42 for reproducibility.*
|