Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π§ Understanding Residual Connections: PlainMLP vs ResMLP
|
| 2 |
+
|
| 3 |
+
[](https://opensource.org/licenses/MIT)
|
| 4 |
+
[](https://www.python.org/downloads/)
|
| 5 |
+
[](https://pytorch.org/)
|
| 6 |
+
|
| 7 |
+
A comprehensive visual deep dive into **why residual connections solve the vanishing gradient problem** and enable training of deep neural networks.
|
| 8 |
+
|
| 9 |
+
## π― Key Finding
|
| 10 |
+
|
| 11 |
+
> With identical initialization and architecture, the **only difference being `+ x` (residual connection)**, PlainMLP completely fails to learn (0% loss reduction) while ResMLP achieves **99.5% loss reduction**.
|
| 12 |
+
|
| 13 |
+
| Model | Initial Loss | Final Loss | Loss Reduction |
|
| 14 |
+
|-------|-------------|------------|----------------|
|
| 15 |
+
| PlainMLP (20 layers) | 0.333 | 0.333 | **0%** β |
|
| 16 |
+
| ResMLP (20 layers) | 13.826 | 0.063 | **99.5%** β
|
|
| 17 |
+
|
| 18 |
+
## π Visual Results
|
| 19 |
+
|
| 20 |
+
### Training Loss Comparison
|
| 21 |
+

|
| 22 |
+
|
| 23 |
+
### Gradient Flow Analysis
|
| 24 |
+
| Layer | PlainMLP Gradient | ResMLP Gradient |
|
| 25 |
+
|-------|-------------------|-----------------|
|
| 26 |
+
| Layer 1 (earliest) | 8.65 Γ 10β»ΒΉβΉ π | 3.78 Γ 10β»Β³ β
|
|
| 27 |
+
| Layer 10 (middle) | 1.07 Γ 10β»βΉ | 2.52 Γ 10β»Β³ β
|
|
| 28 |
+
| Layer 20 (last) | 6.61 Γ 10β»Β³ | 1.91 Γ 10β»Β³ β
|
|
| 29 |
+
|
| 30 |
+
## π¬ Experimental Setup
|
| 31 |
+
|
| 32 |
+
### Task: Distant Identity (Y = X)
|
| 33 |
+
- **Input**: 1024 vectors of dimension 64, sampled from U(-1, 1)
|
| 34 |
+
- **Target**: Y = X (identity mapping)
|
| 35 |
+
- **Challenge**: Can a 20-layer network learn to simply pass input to output?
|
| 36 |
+
|
| 37 |
+
### Architecture Comparison
|
| 38 |
+
|
| 39 |
+
| Component | PlainMLP | ResMLP |
|
| 40 |
+
|-----------|----------|--------|
|
| 41 |
+
| Layer operation | `x = ReLU(Linear(x))` | `x = x + ReLU(Linear(x))` |
|
| 42 |
+
| Depth | 20 layers | 20 layers |
|
| 43 |
+
| Hidden dimension | 64 | 64 |
|
| 44 |
+
| Parameters | 83,200 | 83,200 |
|
| 45 |
+
|
| 46 |
+
### Fair Initialization (Critical!)
|
| 47 |
+
Both models use **identical initialization**:
|
| 48 |
+
- **Weights**: Kaiming He Γ (1/β20) scaling
|
| 49 |
+
- **Biases**: Zero
|
| 50 |
+
- **No LayerNorm, no BatchNorm, no dropout**
|
| 51 |
+
|
| 52 |
+
## π The Core Insight
|
| 53 |
+
|
| 54 |
+
The residual connection `x = x + f(x)` does ONE simple but profound thing:
|
| 55 |
+
|
| 56 |
+
> **It ensures that the gradient of the output with respect to the input is always at least 1.**
|
| 57 |
+
|
| 58 |
+
### Without residual (`x = f(x)`):
|
| 59 |
+
```
|
| 60 |
+
βoutput/βinput = βf/βx
|
| 61 |
+
This can be < 1, and (small)Β²β° β 0
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### With residual (`x = x + f(x)`):
|
| 65 |
+
```
|
| 66 |
+
βoutput/βinput = 1 + βf/βx
|
| 67 |
+
This is always β₯ 1, so (β₯1)Β²β° β₯ 1
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## π Repository Structure
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
resmlp_comparison/
|
| 74 |
+
βββ README.md # This file
|
| 75 |
+
βββ experiment_final.py # Main experiment code
|
| 76 |
+
βββ experiment_fair.py # Fair comparison experiment
|
| 77 |
+
βββ visualize_micro_world.py # Visualization generation
|
| 78 |
+
βββ results_fair.json # Raw numerical results
|
| 79 |
+
βββ report_final.md # Detailed analysis report
|
| 80 |
+
β
|
| 81 |
+
βββ plots_fair/ # Primary result plots
|
| 82 |
+
β βββ training_loss.png
|
| 83 |
+
β βββ gradient_magnitude.png
|
| 84 |
+
β βββ activation_mean.png
|
| 85 |
+
β βββ activation_std.png
|
| 86 |
+
β
|
| 87 |
+
βββ plots_micro/ # Educational visualizations
|
| 88 |
+
βββ 1_signal_flow.png
|
| 89 |
+
βββ 2_gradient_flow.png
|
| 90 |
+
βββ 3_highway_concept.png
|
| 91 |
+
βββ 4_chain_rule.png
|
| 92 |
+
βββ 5_layer_transformation.png
|
| 93 |
+
βββ 6_learning_comparison.png
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
## π Quick Start
|
| 97 |
+
|
| 98 |
+
### Installation
|
| 99 |
+
```bash
|
| 100 |
+
pip install torch numpy matplotlib
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Run Experiment
|
| 104 |
+
```bash
|
| 105 |
+
# Run the main fair comparison experiment
|
| 106 |
+
python experiment_fair.py
|
| 107 |
+
|
| 108 |
+
# Generate micro-world visualizations
|
| 109 |
+
python visualize_micro_world.py
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## π Detailed Visualizations
|
| 113 |
+
|
| 114 |
+
### 1. Signal Flow Through Layers
|
| 115 |
+

|
| 116 |
+
- PlainMLP signal collapses to near-zero by layer 15-20
|
| 117 |
+
- ResMLP signal stays stable throughout all layers
|
| 118 |
+
|
| 119 |
+
### 2. Gradient Flow (Backward Pass)
|
| 120 |
+

|
| 121 |
+
- PlainMLP: Gradient decays from 10β»Β³ to 10β»ΒΉβΉ (essentially zero!)
|
| 122 |
+
- ResMLP: Gradient stays healthy at ~10β»Β³ across ALL layers
|
| 123 |
+
|
| 124 |
+
### 3. The Highway Concept
|
| 125 |
+

|
| 126 |
+
- The `+ x` creates a direct "gradient highway" for information flow
|
| 127 |
+
|
| 128 |
+
### 4. Chain Rule Mathematics
|
| 129 |
+

|
| 130 |
+
- Visual explanation of why gradients vanish mathematically
|
| 131 |
+
|
| 132 |
+
## π Why This Matters
|
| 133 |
+
|
| 134 |
+
### Historical Context
|
| 135 |
+
Before ResNets (2015), training networks deeper than ~20 layers was extremely difficult due to vanishing gradients.
|
| 136 |
+
|
| 137 |
+
### The ResNet Revolution
|
| 138 |
+
He et al.'s simple insight enabled:
|
| 139 |
+
- **ImageNet SOTA** with 152 layers
|
| 140 |
+
- **Foundation for modern architectures**: Transformers use residual connections in every attention block
|
| 141 |
+
- **GPT, BERT, Vision Transformers** all rely on this principle
|
| 142 |
+
|
| 143 |
+
## π Reports
|
| 144 |
+
|
| 145 |
+
- [`report_final.md`](report_final.md) - Comprehensive analysis with all visualizations
|
| 146 |
+
- [`report_fair.md`](report_fair.md) - Fair comparison methodology
|
| 147 |
+
- [`report.md`](report.md) - Initial experiment report
|
| 148 |
+
|
| 149 |
+
## π Citation
|
| 150 |
+
|
| 151 |
+
If you find this educational resource helpful, please consider citing:
|
| 152 |
+
|
| 153 |
+
```bibtex
|
| 154 |
+
@misc{resmlp_comparison,
|
| 155 |
+
title={Understanding Residual Connections: A Visual Deep Dive},
|
| 156 |
+
author={AmberLJC},
|
| 157 |
+
year={2024},
|
| 158 |
+
url={https://huggingface.co/AmberLJC/resmlp_comparison}
|
| 159 |
+
}
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
## π License
|
| 163 |
+
|
| 164 |
+
MIT License - feel free to use for educational purposes!
|
| 165 |
+
|
| 166 |
+
## π Acknowledgments
|
| 167 |
+
|
| 168 |
+
Inspired by the seminal work:
|
| 169 |
+
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR.
|