PlainMLP vs ResMLP: Fair Comparison on Distant Identity Task
Executive Summary
This experiment compares a 20-layer PlainMLP against a 20-layer ResMLP on a synthetic "Distant Identity" task (Y = X), using identical initialization for both models to ensure a fair comparison.
Key Finding: With fair initialization, the PlainMLP shows complete gradient vanishing (gradients at layer 1 are ~10⁻¹⁹), making it essentially untrainable. The ResMLP achieves 5.3x lower loss and maintains healthy gradient flow throughout all layers.
Experimental Setup
Models (IDENTICAL Initialization)
| Property | PlainMLP | ResMLP |
|---|---|---|
| Architecture | x = ReLU(Linear(x)) |
x = x + ReLU(Linear(x)) |
| Layers | 20 | 20 |
| Hidden Dimension | 64 | 64 |
| Parameters | 83,200 | 83,200 |
| Weight Init | Kaiming He × 1/√20 | Kaiming He × 1/√20 |
| Bias Init | Zero | Zero |
Critical: Both models use identical weight initialization (Kaiming He scaled by 1/√num_layers). The ONLY difference is the residual connection.
Training Configuration
- Task: Learn identity mapping Y = X
- Data: 1024 vectors, dimension 64, sampled from U(-1, 1)
- Optimizer: Adam (lr=1e-3)
- Batch Size: 64
- Training Steps: 500
- Loss: MSE
Results
1. Training Loss Comparison
| Metric | PlainMLP | ResMLP |
|---|---|---|
| Initial Loss | 0.333 | 13.826 |
| Final Loss | 0.333 | 0.063 |
| Loss Reduction | 0% | 99.5% |
| Improvement | - | 5.3x better |
Key Observation: PlainMLP shows zero learning - the loss stays flat at ~0.33 throughout training. ResMLP starts with higher loss (due to accumulated residuals) but rapidly converges to 0.063.
2. Gradient Flow Analysis
| Layer | PlainMLP Gradient | ResMLP Gradient |
|---|---|---|
| Layer 1 (earliest) | 8.65 × 10⁻¹⁹ | 3.78 × 10⁻³ |
| Layer 10 (middle) | ~10⁻¹⁰ | ~2.5 × 10⁻³ |
| Layer 20 (last) | 6.61 × 10⁻³ | 1.91 × 10⁻³ |
Critical Finding: PlainMLP gradients at layer 1 are essentially zero (10⁻¹⁹ is numerical noise). This is the vanishing gradient problem in its most extreme form. The network cannot learn because gradients don't reach early layers.
ResMLP maintains gradients in the 10⁻³ range across all layers - healthy for learning.
3. Activation Statistics
| Metric | PlainMLP | ResMLP |
|---|---|---|
| Std Range | [0.0000, 0.1795] | [0.1348, 0.1767] |
| Layer 20 Std | ~0 | 0.135 |
Key Observation: PlainMLP activations collapse to zero in later layers. The signal is completely lost by the time it reaches the output. ResMLP maintains stable activation statistics throughout.
Why This Happens
PlainMLP: Multiplicative Gradient Path
In PlainMLP, gradients must flow through all 20 layers multiplicatively:
∂L/∂x₁ = ∂L/∂x₂₀ × ∂x₂₀/∂x₁₉ × ... × ∂x₂/∂x₁
With small weights (scaled by 1/√20 ≈ 0.224), each multiplication shrinks the gradient. After 20 layers:
- Gradient scale ≈ (0.224)²⁰ ≈ 10⁻¹³ (theoretical)
- Actual: 10⁻¹⁹ (even worse due to ReLU zeros)
ResMLP: Additive Gradient Path
In ResMLP, the identity shortcut provides a direct gradient path:
∂L/∂x₁ = ∂L/∂x₂₀ × (1 + ∂f₂₀/∂x₁₉) × ... × (1 + ∂f₂/∂x₁)
The "1 +" terms ensure gradients never vanish completely. Even if the residual branch gradients are small, the identity path preserves gradient flow.
Conclusions
Residual connections are essential for deep networks: With identical initialization, PlainMLP is completely untrainable (0% loss reduction) while ResMLP achieves 99.5% loss reduction.
Vanishing gradients are catastrophic: PlainMLP gradients at layer 1 are 10⁻¹⁹ - effectively zero. No amount of training can fix this.
The identity shortcut is the key: The only architectural difference is
x = f(x)vsx = x + f(x), yet this makes the difference between a dead network and a functional one.Fair comparison matters: The previous experiment gave PlainMLP standard Kaiming init while ResMLP had scaled init. This fair comparison shows the true power of residual connections.
Reproducibility
cd projects/resmlp_comparison
python experiment_fair.py
All results are saved to results_fair.json and plots to plots_fair/.
Experiment conducted with PyTorch, random seed 42 for reproducibility.



