AmberLJC commited on
Commit
214f5ea
·
verified ·
1 Parent(s): b343478

Upload report_fair.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. report_fair.md +126 -0
report_fair.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PlainMLP vs ResMLP: Fair Comparison on Distant Identity Task
2
+
3
+ ## Executive Summary
4
+
5
+ This experiment compares a 20-layer PlainMLP against a 20-layer ResMLP on a synthetic "Distant Identity" task (Y = X), using **identical initialization** for both models to ensure a fair comparison.
6
+
7
+ **Key Finding**: With fair initialization, the PlainMLP shows **complete gradient vanishing** (gradients at layer 1 are ~10⁻¹⁹), making it essentially untrainable. The ResMLP achieves **5.3x lower loss** and maintains healthy gradient flow throughout all layers.
8
+
9
+ ---
10
+
11
+ ## Experimental Setup
12
+
13
+ ### Models (IDENTICAL Initialization)
14
+ | Property | PlainMLP | ResMLP |
15
+ |----------|----------|--------|
16
+ | Architecture | `x = ReLU(Linear(x))` | `x = x + ReLU(Linear(x))` |
17
+ | Layers | 20 | 20 |
18
+ | Hidden Dimension | 64 | 64 |
19
+ | Parameters | 83,200 | 83,200 |
20
+ | Weight Init | Kaiming He × 1/√20 | Kaiming He × 1/√20 |
21
+ | Bias Init | Zero | Zero |
22
+
23
+ **Critical**: Both models use **identical** weight initialization (Kaiming He scaled by 1/√num_layers). The ONLY difference is the residual connection.
24
+
25
+ ### Training Configuration
26
+ - **Task**: Learn identity mapping Y = X
27
+ - **Data**: 1024 vectors, dimension 64, sampled from U(-1, 1)
28
+ - **Optimizer**: Adam (lr=1e-3)
29
+ - **Batch Size**: 64
30
+ - **Training Steps**: 500
31
+ - **Loss**: MSE
32
+
33
+ ---
34
+
35
+ ## Results
36
+
37
+ ### 1. Training Loss Comparison
38
+
39
+ ![Training Loss](training_loss.png)
40
+
41
+ | Metric | PlainMLP | ResMLP |
42
+ |--------|----------|--------|
43
+ | Initial Loss | 0.333 | 13.826 |
44
+ | Final Loss | 0.333 | 0.063 |
45
+ | Loss Reduction | **0%** | **99.5%** |
46
+ | Improvement | - | **5.3x better** |
47
+
48
+ **Key Observation**: PlainMLP shows **zero learning** - the loss stays flat at ~0.33 throughout training. ResMLP starts with higher loss (due to accumulated residuals) but rapidly converges to 0.063.
49
+
50
+ ### 2. Gradient Flow Analysis
51
+
52
+ ![Gradient Magnitude](gradient_magnitude.png)
53
+
54
+ | Layer | PlainMLP Gradient | ResMLP Gradient |
55
+ |-------|-------------------|-----------------|
56
+ | Layer 1 (earliest) | **8.65 × 10⁻¹⁹** | 3.78 × 10⁻³ |
57
+ | Layer 10 (middle) | ~10⁻¹⁰ | ~2.5 × 10⁻³ |
58
+ | Layer 20 (last) | 6.61 × 10⁻³ | 1.91 × 10⁻³ |
59
+
60
+ **Critical Finding**: PlainMLP gradients at layer 1 are essentially **zero** (10⁻¹⁹ is numerical noise). This is the **vanishing gradient problem** in its most extreme form. The network cannot learn because gradients don't reach early layers.
61
+
62
+ ResMLP maintains gradients in the 10⁻³ range across all layers - healthy for learning.
63
+
64
+ ### 3. Activation Statistics
65
+
66
+ ![Activation Std](activation_std.png)
67
+
68
+ | Metric | PlainMLP | ResMLP |
69
+ |--------|----------|--------|
70
+ | Std Range | [0.0000, 0.1795] | [0.1348, 0.1767] |
71
+ | Layer 20 Std | ~0 | 0.135 |
72
+
73
+ **Key Observation**: PlainMLP activations **collapse to zero** in later layers. The signal is completely lost by the time it reaches the output. ResMLP maintains stable activation statistics throughout.
74
+
75
+ ![Activation Mean](activation_mean.png)
76
+
77
+ ---
78
+
79
+ ## Why This Happens
80
+
81
+ ### PlainMLP: Multiplicative Gradient Path
82
+ In PlainMLP, gradients must flow through **all 20 layers multiplicatively**:
83
+
84
+ ```
85
+ ∂L/∂x₁ = ∂L/∂x₂₀ × ∂x₂₀/∂x₁₉ × ... × ∂x₂/∂x₁
86
+ ```
87
+
88
+ With small weights (scaled by 1/√20 ≈ 0.224), each multiplication shrinks the gradient. After 20 layers:
89
+ - Gradient scale ≈ (0.224)²⁰ ≈ 10⁻¹³ (theoretical)
90
+ - Actual: 10⁻¹⁹ (even worse due to ReLU zeros)
91
+
92
+ ### ResMLP: Additive Gradient Path
93
+ In ResMLP, the identity shortcut provides a **direct gradient path**:
94
+
95
+ ```
96
+ ∂L/∂x₁ = ∂L/∂x₂₀ × (1 + ∂f₂₀/∂x₁₉) × ... × (1 + ∂f₂/∂x₁)
97
+ ```
98
+
99
+ The "1 +" terms ensure gradients never vanish completely. Even if the residual branch gradients are small, the identity path preserves gradient flow.
100
+
101
+ ---
102
+
103
+ ## Conclusions
104
+
105
+ 1. **Residual connections are essential for deep networks**: With identical initialization, PlainMLP is completely untrainable (0% loss reduction) while ResMLP achieves 99.5% loss reduction.
106
+
107
+ 2. **Vanishing gradients are catastrophic**: PlainMLP gradients at layer 1 are 10⁻¹⁹ - effectively zero. No amount of training can fix this.
108
+
109
+ 3. **The identity shortcut is the key**: The only architectural difference is `x = f(x)` vs `x = x + f(x)`, yet this makes the difference between a dead network and a functional one.
110
+
111
+ 4. **Fair comparison matters**: The previous experiment gave PlainMLP standard Kaiming init while ResMLP had scaled init. This fair comparison shows the true power of residual connections.
112
+
113
+ ---
114
+
115
+ ## Reproducibility
116
+
117
+ ```bash
118
+ cd projects/resmlp_comparison
119
+ python experiment_fair.py
120
+ ```
121
+
122
+ All results are saved to `results_fair.json` and plots to `plots_fair/`.
123
+
124
+ ---
125
+
126
+ *Experiment conducted with PyTorch, random seed 42 for reproducibility.*