YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

🧠 Understanding Residual Connections: PlainMLP vs ResMLP

A comprehensive visual deep dive into why residual connections solve the vanishing gradient problem and enable training of deep neural networks.

🎯 Key Finding

With identical initialization and architecture, the only difference being + x (residual connection), PlainMLP completely fails to learn (0% loss reduction) while ResMLP achieves 99.5% loss reduction.

Model	Initial Loss	Final Loss	Loss Reduction
PlainMLP (20 layers)	0.333	0.333	0% ❌
ResMLP (20 layers)	13.826	0.063	99.5% ✅

📊 Visual Results

Training Loss Comparison

Gradient Flow Analysis

Layer	PlainMLP Gradient	ResMLP Gradient
Layer 1 (earliest)	8.65 × 10⁻¹⁹ 💀	3.78 × 10⁻³ ✅
Layer 10 (middle)	1.07 × 10⁻⁹	2.52 × 10⁻³ ✅
Layer 20 (last)	6.61 × 10⁻³	1.91 × 10⁻³ ✅

🔬 Experimental Setup

Task: Distant Identity (Y = X)

Input: 1024 vectors of dimension 64, sampled from U(-1, 1)
Target: Y = X (identity mapping)
Challenge: Can a 20-layer network learn to simply pass input to output?

Architecture Comparison

Component	PlainMLP	ResMLP
Layer operation	`x = ReLU(Linear(x))`	`x = x + ReLU(Linear(x))`
Depth	20 layers	20 layers
Hidden dimension	64	64
Parameters	83,200	83,200

Fair Initialization (Critical!)

Both models use identical initialization:

Weights: Kaiming He × (1/√20) scaling
Biases: Zero
No LayerNorm, no BatchNorm, no dropout

🎓 The Core Insight

The residual connection x = x + f(x) does ONE simple but profound thing:

It ensures that the gradient of the output with respect to the input is always at least 1.

Without residual (`x = f(x)`):

∂output/∂input = ∂f/∂x 
This can be < 1, and (small)²⁰ → 0

With residual (`x = x + f(x)`):

∂output/∂input = 1 + ∂f/∂x
This is always ≥ 1, so (≥1)²⁰ ≥ 1

📁 Repository Structure

resmlp_comparison/
├── README.md                    # This file
├── experiment_final.py          # Main experiment code
├── experiment_fair.py           # Fair comparison experiment
├── visualize_micro_world.py     # Visualization generation
├── results_fair.json            # Raw numerical results
├── report_final.md              # Detailed analysis report
│
├── plots_fair/                  # Primary result plots
│   ├── training_loss.png
│   ├── gradient_magnitude.png
│   ├── activation_mean.png
│   └── activation_std.png
│
└── plots_micro/                 # Educational visualizations
    ├── 1_signal_flow.png
    ├── 2_gradient_flow.png
    ├── 3_highway_concept.png
    ├── 4_chain_rule.png
    ├── 5_layer_transformation.png
    └── 6_learning_comparison.png

🚀 Quick Start

Installation

pip install torch numpy matplotlib

Run Experiment

# Run the main fair comparison experiment
python experiment_fair.py

# Generate micro-world visualizations
python visualize_micro_world.py

📈 Detailed Visualizations

1. Signal Flow Through Layers

PlainMLP signal collapses to near-zero by layer 15-20
ResMLP signal stays stable throughout all layers

2. Gradient Flow (Backward Pass)

PlainMLP: Gradient decays from 10⁻³ to 10⁻¹⁹ (essentially zero!)
ResMLP: Gradient stays healthy at ~10⁻³ across ALL layers

3. The Highway Concept

The + x creates a direct "gradient highway" for information flow

4. Chain Rule Mathematics

Visual explanation of why gradients vanish mathematically

🔑 Why This Matters

Historical Context

Before ResNets (2015), training networks deeper than ~20 layers was extremely difficult due to vanishing gradients.

The ResNet Revolution

He et al.'s simple insight enabled:

ImageNet SOTA with 152 layers
Foundation for modern architectures: Transformers use residual connections in every attention block
GPT, BERT, Vision Transformers all rely on this principle

📚 Reports

report_final.md - Comprehensive analysis with all visualizations
report_fair.md - Fair comparison methodology
report.md - Initial experiment report

📖 Citation

If you find this educational resource helpful, please consider citing:

@misc{resmlp_comparison,
  title={Understanding Residual Connections: A Visual Deep Dive},
  author={AmberLJC},
  year={2024},
  url={https://huggingface.co/AmberLJC/resmlp_comparison}
}

📄 License

MIT License - feel free to use for educational purposes!

🙏 Acknowledgments

Inspired by the seminal work:

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support