File size: 7,977 Bytes
b032934 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | # π§ Activation Functions: Deep Neural Network Analysis
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
> **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.**
This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks.
---
## π― Key Findings
| Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
|------------|-----------|-------------------------|--------|
| **ReLU** | **0.008** | 1.93 (stable) | β
Excellent |
| **Leaky ReLU** | **0.008** | 0.72 (stable) | β
Excellent |
| **GELU** | **0.008** | 0.83 (stable) | β
Excellent |
| Linear | 0.213 | 0.84 (stable) | β οΈ Cannot learn non-linearity |
| Sigmoid | 0.518 | **2.59Γ10β·** (vanishing!) | β Failed |
### π¬ The Vanishing Gradient Problem - Visualized
```
Sigmoid Network (10 layers):
Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 5.04Γ10β»ΒΉ
Layer 5 ββββββββββββ Gradient: 1.02Γ10β»β΄
Layer 10 β Gradient: 1.94Γ10β»βΈ β 26 MILLION times smaller!
ReLU Network (10 layers):
Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 2.70Γ10β»Β³
Layer 5 ββββββββββββββββββββββββββββββββββββββ Gradient: 2.10Γ10β»Β³
Layer 10 ββββββββββββββββββββββββββββββββββββββββ Gradient: 1.36Γ10β»Β³ β Healthy flow!
```
---
## π Visual Results
### Learned Functions

*ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.*
### Training Dynamics

### Gradient Flow Analysis

### Comprehensive Summary

---
## π§ͺ Experimental Setup
### Architecture
- **Network**: 10 hidden layers Γ 64 neurons each
- **Task**: 1D non-linear regression (sine wave approximation)
- **Dataset**: `y = sin(x) + Ξ΅`, where `x β [-Ο, Ο]` and `Ξ΅ ~ N(0, 0.1)`
### Training Configuration
```python
optimizer = Adam(lr=0.001)
loss_fn = MSELoss()
epochs = 500
batch_size = full_batch (200 samples)
seed = 42
```
### Activation Functions Tested
| Function | Formula | Gradient Range |
|----------|---------|----------------|
| Linear | `f(x) = x` | Always 1 |
| Sigmoid | `f(x) = 1/(1+eβ»Λ£)` | (0, 0.25] |
| ReLU | `f(x) = max(0, x)` | {0, 1} |
| Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} |
| GELU | `f(x) = xΒ·Ξ¦(x)` | Smooth, ~(0, 1) |
---
## π Quick Start
### Installation
```bash
git clone https://huggingface.co/AmberLJC/activation_functions
cd activation_functions
pip install torch numpy matplotlib
```
### Run the Experiment
```bash
# Basic 5-activation comparison
python train.py
# Extended tutorial with 8 activations and 4 experiments
python tutorial_experiments.py
# Training dynamics analysis
python train_dynamics.py
```
---
## π Repository Structure
```
activation_functions/
βββ README.md # This file
βββ report.md # Detailed analysis report
βββ activation_tutorial.md # Educational tutorial
β
βββ train.py # Main experiment (5 activations)
βββ tutorial_experiments.py # Extended experiments (8 activations)
βββ train_dynamics.py # Training dynamics analysis
β
βββ learned_functions.png # Predictions vs ground truth
βββ loss_curves.png # Training loss over epochs
βββ gradient_flow.png # Gradient magnitude per layer
βββ hidden_activations.png # Activation patterns
βββ summary_figure.png # 9-panel comprehensive summary
β
βββ exp1_gradient_flow.png # Extended gradient analysis
βββ exp2_activation_distributions.png # Activation distribution analysis
βββ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis
βββ exp3_stability.png # Training stability analysis
βββ exp4_predictions.png # Function approximation comparison
βββ exp4_representational_heatmap.png # Representational capacity heatmap
β
βββ activation_evolution.png # Activation evolution during training
βββ gradient_evolution.png # Gradient evolution during training
βββ training_dynamics_functions.png # Training dynamics visualization
βββ training_dynamics_summary.png # Training dynamics summary
β
βββ loss_histories.json # Raw loss data
βββ gradient_magnitudes.json # Gradient measurements
βββ gradient_magnitudes_epochs.json # Gradient evolution data
βββ exp1_gradient_flow.json # Extended gradient data
βββ final_losses.json # Final MSE per activation
```
---
## π Key Insights
### Why Sigmoid Fails in Deep Networks
The **vanishing gradient problem** occurs because:
1. **Sigmoid derivative is bounded**: max(Ο'(x)) = 0.25 at x=0
2. **Chain rule multiplies gradients**: For 10 layers, gradient β (0.25)ΒΉβ° β 10β»βΆ
3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers
```python
# Theoretical gradient decay for Sigmoid
gradient_layer_10 = gradient_output * (0.25)^10
β gradient_output * 0.000001
β 0 # Effectively zero!
```
### Why ReLU Works
ReLU maintains **unit gradient** for positive inputs:
```python
# ReLU gradient
f'(x) = 1 if x > 0 else 0
# No multiplicative decay!
gradient_layer_10 β gradient_output * 1^10 = gradient_output
```
### Practical Recommendations
| Use Case | Recommended |
|----------|-------------|
| Default choice | ReLU or Leaky ReLU |
| Transformers/LLMs | GELU |
| Very deep networks | Leaky ReLU + skip connections |
| Output (classification) | Sigmoid/Softmax |
| Output (regression) | Linear |
---
## π Extended Experiments
The `tutorial_experiments.py` script includes 4 additional experiments:
1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers
2. **Activation Distributions** - Sparsity and dead neuron analysis
3. **Training Stability** - Learning rate and depth sensitivity
4. **Representational Capacity** - Multiple target function approximation
---
## π References
- [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
- [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
- [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
- [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)
---
## π Citation
```bibtex
@misc{activation_functions_analysis,
title={Activation Functions: Deep Neural Network Analysis},
author={Orchestra Research},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/AmberLJC/activation_functions}
}
```
---
## π License
MIT License - feel free to use for education and research!
---
*Generated by Orchestra Research Assistant*
|