AmberLJC
/

resmlp_comparison

Model card Files Files and versions

xet

Community

AmberLJC commited on 22 days ago

Commit

538e428

verified ·

1 Parent(s): 2c24e4a

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +169 -0

README.md ADDED Viewed

	@@ -0,0 +1,169 @@

+# 🧠 Understanding Residual Connections: PlainMLP vs ResMLP
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
+A comprehensive visual deep dive into **why residual connections solve the vanishing gradient problem** and enable training of deep neural networks.
+## 🎯 Key Finding
+> With identical initialization and architecture, the **only difference being `+ x` (residual connection)**, PlainMLP completely fails to learn (0% loss reduction) while ResMLP achieves **99.5% loss reduction**.
+| Model | Initial Loss | Final Loss | Loss Reduction |
+|-------|-------------|------------|----------------|
+| PlainMLP (20 layers) | 0.333 | 0.333 | **0%** ❌ |
+| ResMLP (20 layers) | 13.826 | 0.063 | **99.5%** ✅ |
+## 📊 Visual Results
+### Training Loss Comparison
+![Training Loss](plots_fair/training_loss.png)
+### Gradient Flow Analysis
+| Layer | PlainMLP Gradient | ResMLP Gradient |
+|-------|-------------------|-----------------|
+| Layer 1 (earliest) | 8.65 × 10⁻¹⁹ 💀 | 3.78 × 10⁻³ ✅ |
+| Layer 10 (middle) | 1.07 × 10⁻⁹ | 2.52 × 10⁻³ ✅ |
+| Layer 20 (last) | 6.61 × 10⁻³ | 1.91 × 10⁻³ ✅ |
+## 🔬 Experimental Setup
+### Task: Distant Identity (Y = X)
+- **Input**: 1024 vectors of dimension 64, sampled from U(-1, 1)
+- **Target**: Y = X (identity mapping)
+- **Challenge**: Can a 20-layer network learn to simply pass input to output?
+### Architecture Comparison
+| Component | PlainMLP | ResMLP |
+|-----------|----------|--------|
+| Layer operation | `x = ReLU(Linear(x))` | `x = x + ReLU(Linear(x))` |
+| Depth | 20 layers | 20 layers |
+| Hidden dimension | 64 | 64 |
+| Parameters | 83,200 | 83,200 |
+### Fair Initialization (Critical!)
+Both models use **identical initialization**:
+- **Weights**: Kaiming He × (1/√20) scaling
+- **Biases**: Zero
+- **No LayerNorm, no BatchNorm, no dropout**
+## 🎓 The Core Insight
+The residual connection `x = x + f(x)` does ONE simple but profound thing:
+> **It ensures that the gradient of the output with respect to the input is always at least 1.**
+### Without residual (`x = f(x)`):
+```
+∂output/∂input = ∂f/∂x
+This can be < 1, and (small)²⁰ → 0
+```
+### With residual (`x = x + f(x)`):
+```
+∂output/∂input = 1 + ∂f/∂x
+This is always ≥ 1, so (≥1)²⁰ ≥ 1
+```
+## 📁 Repository Structure
+```
+resmlp_comparison/
+├── README.md                    # This file
+├── experiment_final.py          # Main experiment code
+├── experiment_fair.py           # Fair comparison experiment
+├── visualize_micro_world.py     # Visualization generation
+├── results_fair.json            # Raw numerical results
+├── report_final.md              # Detailed analysis report
+│
+├── plots_fair/                  # Primary result plots
+│   ├── training_loss.png
+│   ├── gradient_magnitude.png
+│   ├── activation_mean.png
+│   └── activation_std.png
+│
+└── plots_micro/                 # Educational visualizations
+    ├── 1_signal_flow.png
+    ├── 2_gradient_flow.png
+    ├── 3_highway_concept.png
+    ├── 4_chain_rule.png
+    ├── 5_layer_transformation.png
+    └── 6_learning_comparison.png
+```
+## 🚀 Quick Start
+### Installation
+```bash
+pip install torch numpy matplotlib
+```
+### Run Experiment
+```bash
+# Run the main fair comparison experiment
+python experiment_fair.py
+# Generate micro-world visualizations
+python visualize_micro_world.py
+```
+## 📈 Detailed Visualizations
+### 1. Signal Flow Through Layers
+![Signal Flow](plots_micro/1_signal_flow.png)
+- PlainMLP signal collapses to near-zero by layer 15-20
+- ResMLP signal stays stable throughout all layers
+### 2. Gradient Flow (Backward Pass)
+![Gradient Flow](plots_micro/2_gradient_flow.png)
+- PlainMLP: Gradient decays from 10⁻³ to 10⁻¹⁹ (essentially zero!)
+- ResMLP: Gradient stays healthy at ~10⁻³ across ALL layers
+### 3. The Highway Concept
+![Highway](plots_micro/3_highway_concept.png)
+- The `+ x` creates a direct "gradient highway" for information flow
+### 4. Chain Rule Mathematics
+![Chain Rule](plots_micro/4_chain_rule.png)
+- Visual explanation of why gradients vanish mathematically
+## 🔑 Why This Matters
+### Historical Context
+Before ResNets (2015), training networks deeper than ~20 layers was extremely difficult due to vanishing gradients.
+### The ResNet Revolution
+He et al.'s simple insight enabled:
+- **ImageNet SOTA** with 152 layers
+- **Foundation for modern architectures**: Transformers use residual connections in every attention block
+- **GPT, BERT, Vision Transformers** all rely on this principle
+## 📚 Reports
+- [`report_final.md`](report_final.md) - Comprehensive analysis with all visualizations
+- [`report_fair.md`](report_fair.md) - Fair comparison methodology
+- [`report.md`](report.md) - Initial experiment report
+## 📖 Citation
+If you find this educational resource helpful, please consider citing:
+```bibtex
+@misc{resmlp_comparison,
+  title={Understanding Residual Connections: A Visual Deep Dive},
+  author={AmberLJC},
+  year={2024},
+  url={https://huggingface.co/AmberLJC/resmlp_comparison}
+}
+```
+## 📄 License
+MIT License - feel free to use for educational purposes!
+## 🙏 Acknowledgments
+Inspired by the seminal work:
+- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR.