AmberLJC commited on
Commit
538e428
Β·
verified Β·
1 Parent(s): 2c24e4a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +169 -0
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Understanding Residual Connections: PlainMLP vs ResMLP
2
+
3
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
5
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
6
+
7
+ A comprehensive visual deep dive into **why residual connections solve the vanishing gradient problem** and enable training of deep neural networks.
8
+
9
+ ## 🎯 Key Finding
10
+
11
+ > With identical initialization and architecture, the **only difference being `+ x` (residual connection)**, PlainMLP completely fails to learn (0% loss reduction) while ResMLP achieves **99.5% loss reduction**.
12
+
13
+ | Model | Initial Loss | Final Loss | Loss Reduction |
14
+ |-------|-------------|------------|----------------|
15
+ | PlainMLP (20 layers) | 0.333 | 0.333 | **0%** ❌ |
16
+ | ResMLP (20 layers) | 13.826 | 0.063 | **99.5%** βœ… |
17
+
18
+ ## πŸ“Š Visual Results
19
+
20
+ ### Training Loss Comparison
21
+ ![Training Loss](plots_fair/training_loss.png)
22
+
23
+ ### Gradient Flow Analysis
24
+ | Layer | PlainMLP Gradient | ResMLP Gradient |
25
+ |-------|-------------------|-----------------|
26
+ | Layer 1 (earliest) | 8.65 Γ— 10⁻¹⁹ πŸ’€ | 3.78 Γ— 10⁻³ βœ… |
27
+ | Layer 10 (middle) | 1.07 Γ— 10⁻⁹ | 2.52 Γ— 10⁻³ βœ… |
28
+ | Layer 20 (last) | 6.61 Γ— 10⁻³ | 1.91 Γ— 10⁻³ βœ… |
29
+
30
+ ## πŸ”¬ Experimental Setup
31
+
32
+ ### Task: Distant Identity (Y = X)
33
+ - **Input**: 1024 vectors of dimension 64, sampled from U(-1, 1)
34
+ - **Target**: Y = X (identity mapping)
35
+ - **Challenge**: Can a 20-layer network learn to simply pass input to output?
36
+
37
+ ### Architecture Comparison
38
+
39
+ | Component | PlainMLP | ResMLP |
40
+ |-----------|----------|--------|
41
+ | Layer operation | `x = ReLU(Linear(x))` | `x = x + ReLU(Linear(x))` |
42
+ | Depth | 20 layers | 20 layers |
43
+ | Hidden dimension | 64 | 64 |
44
+ | Parameters | 83,200 | 83,200 |
45
+
46
+ ### Fair Initialization (Critical!)
47
+ Both models use **identical initialization**:
48
+ - **Weights**: Kaiming He Γ— (1/√20) scaling
49
+ - **Biases**: Zero
50
+ - **No LayerNorm, no BatchNorm, no dropout**
51
+
52
+ ## πŸŽ“ The Core Insight
53
+
54
+ The residual connection `x = x + f(x)` does ONE simple but profound thing:
55
+
56
+ > **It ensures that the gradient of the output with respect to the input is always at least 1.**
57
+
58
+ ### Without residual (`x = f(x)`):
59
+ ```
60
+ βˆ‚output/βˆ‚input = βˆ‚f/βˆ‚x
61
+ This can be < 1, and (small)²⁰ β†’ 0
62
+ ```
63
+
64
+ ### With residual (`x = x + f(x)`):
65
+ ```
66
+ βˆ‚output/βˆ‚input = 1 + βˆ‚f/βˆ‚x
67
+ This is always β‰₯ 1, so (β‰₯1)²⁰ β‰₯ 1
68
+ ```
69
+
70
+ ## πŸ“ Repository Structure
71
+
72
+ ```
73
+ resmlp_comparison/
74
+ β”œβ”€β”€ README.md # This file
75
+ β”œβ”€β”€ experiment_final.py # Main experiment code
76
+ β”œβ”€β”€ experiment_fair.py # Fair comparison experiment
77
+ β”œβ”€β”€ visualize_micro_world.py # Visualization generation
78
+ β”œβ”€β”€ results_fair.json # Raw numerical results
79
+ β”œβ”€β”€ report_final.md # Detailed analysis report
80
+ β”‚
81
+ β”œβ”€β”€ plots_fair/ # Primary result plots
82
+ β”‚ β”œβ”€β”€ training_loss.png
83
+ β”‚ β”œβ”€β”€ gradient_magnitude.png
84
+ β”‚ β”œβ”€β”€ activation_mean.png
85
+ β”‚ └── activation_std.png
86
+ β”‚
87
+ └── plots_micro/ # Educational visualizations
88
+ β”œβ”€β”€ 1_signal_flow.png
89
+ β”œβ”€β”€ 2_gradient_flow.png
90
+ β”œβ”€β”€ 3_highway_concept.png
91
+ β”œβ”€β”€ 4_chain_rule.png
92
+ β”œβ”€β”€ 5_layer_transformation.png
93
+ └── 6_learning_comparison.png
94
+ ```
95
+
96
+ ## πŸš€ Quick Start
97
+
98
+ ### Installation
99
+ ```bash
100
+ pip install torch numpy matplotlib
101
+ ```
102
+
103
+ ### Run Experiment
104
+ ```bash
105
+ # Run the main fair comparison experiment
106
+ python experiment_fair.py
107
+
108
+ # Generate micro-world visualizations
109
+ python visualize_micro_world.py
110
+ ```
111
+
112
+ ## πŸ“ˆ Detailed Visualizations
113
+
114
+ ### 1. Signal Flow Through Layers
115
+ ![Signal Flow](plots_micro/1_signal_flow.png)
116
+ - PlainMLP signal collapses to near-zero by layer 15-20
117
+ - ResMLP signal stays stable throughout all layers
118
+
119
+ ### 2. Gradient Flow (Backward Pass)
120
+ ![Gradient Flow](plots_micro/2_gradient_flow.png)
121
+ - PlainMLP: Gradient decays from 10⁻³ to 10⁻¹⁹ (essentially zero!)
122
+ - ResMLP: Gradient stays healthy at ~10⁻³ across ALL layers
123
+
124
+ ### 3. The Highway Concept
125
+ ![Highway](plots_micro/3_highway_concept.png)
126
+ - The `+ x` creates a direct "gradient highway" for information flow
127
+
128
+ ### 4. Chain Rule Mathematics
129
+ ![Chain Rule](plots_micro/4_chain_rule.png)
130
+ - Visual explanation of why gradients vanish mathematically
131
+
132
+ ## πŸ”‘ Why This Matters
133
+
134
+ ### Historical Context
135
+ Before ResNets (2015), training networks deeper than ~20 layers was extremely difficult due to vanishing gradients.
136
+
137
+ ### The ResNet Revolution
138
+ He et al.'s simple insight enabled:
139
+ - **ImageNet SOTA** with 152 layers
140
+ - **Foundation for modern architectures**: Transformers use residual connections in every attention block
141
+ - **GPT, BERT, Vision Transformers** all rely on this principle
142
+
143
+ ## πŸ“š Reports
144
+
145
+ - [`report_final.md`](report_final.md) - Comprehensive analysis with all visualizations
146
+ - [`report_fair.md`](report_fair.md) - Fair comparison methodology
147
+ - [`report.md`](report.md) - Initial experiment report
148
+
149
+ ## πŸ“– Citation
150
+
151
+ If you find this educational resource helpful, please consider citing:
152
+
153
+ ```bibtex
154
+ @misc{resmlp_comparison,
155
+ title={Understanding Residual Connections: A Visual Deep Dive},
156
+ author={AmberLJC},
157
+ year={2024},
158
+ url={https://huggingface.co/AmberLJC/resmlp_comparison}
159
+ }
160
+ ```
161
+
162
+ ## πŸ“„ License
163
+
164
+ MIT License - feel free to use for educational purposes!
165
+
166
+ ## πŸ™ Acknowledgments
167
+
168
+ Inspired by the seminal work:
169
+ - He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR.