File size: 7,977 Bytes
b032934
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# 🧠 Activation Functions: Deep Neural Network Analysis

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)

> **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.**

This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks.

---

## 🎯 Key Findings

| Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
|------------|-----------|-------------------------|--------|
| **ReLU** | **0.008** | 1.93 (stable) | βœ… Excellent |
| **Leaky ReLU** | **0.008** | 0.72 (stable) | βœ… Excellent |
| **GELU** | **0.008** | 0.83 (stable) | βœ… Excellent |
| Linear | 0.213 | 0.84 (stable) | ⚠️ Cannot learn non-linearity |
| Sigmoid | 0.518 | **2.59Γ—10⁷** (vanishing!) | ❌ Failed |

### πŸ”¬ The Vanishing Gradient Problem - Visualized

```
Sigmoid Network (10 layers):
Layer 1  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 5.04Γ—10⁻¹
Layer 5  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              Gradient: 1.02Γ—10⁻⁴  
Layer 10 ▏                                         Gradient: 1.94Γ—10⁻⁸  ← 26 MILLION times smaller!

ReLU Network (10 layers):
Layer 1  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 2.70Γ—10⁻³
Layer 5  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    Gradient: 2.10Γ—10⁻³
Layer 10 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  Gradient: 1.36Γ—10⁻³  ← Healthy flow!
```

---

## πŸ“Š Visual Results

### Learned Functions
![Learned Functions](learned_functions.png)

*ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.*

### Training Dynamics
![Loss Curves](loss_curves.png)

### Gradient Flow Analysis
![Gradient Flow](gradient_flow.png)

### Comprehensive Summary
![Summary](summary_figure.png)

---

## πŸ§ͺ Experimental Setup

### Architecture
- **Network**: 10 hidden layers Γ— 64 neurons each
- **Task**: 1D non-linear regression (sine wave approximation)
- **Dataset**: `y = sin(x) + Ξ΅`, where `x ∈ [-Ο€, Ο€]` and `Ξ΅ ~ N(0, 0.1)`

### Training Configuration
```python
optimizer = Adam(lr=0.001)
loss_fn = MSELoss()
epochs = 500
batch_size = full_batch (200 samples)
seed = 42
```

### Activation Functions Tested
| Function | Formula | Gradient Range |
|----------|---------|----------------|
| Linear | `f(x) = x` | Always 1 |
| Sigmoid | `f(x) = 1/(1+e⁻ˣ)` | (0, 0.25] |
| ReLU | `f(x) = max(0, x)` | {0, 1} |
| Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} |
| GELU | `f(x) = xΒ·Ξ¦(x)` | Smooth, ~(0, 1) |

---

## πŸš€ Quick Start

### Installation
```bash
git clone https://huggingface.co/AmberLJC/activation_functions
cd activation_functions
pip install torch numpy matplotlib
```

### Run the Experiment
```bash
# Basic 5-activation comparison
python train.py

# Extended tutorial with 8 activations and 4 experiments
python tutorial_experiments.py

# Training dynamics analysis
python train_dynamics.py
```

---

## πŸ“ Repository Structure

```
activation_functions/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ report.md                          # Detailed analysis report
β”œβ”€β”€ activation_tutorial.md             # Educational tutorial
β”‚
β”œβ”€β”€ train.py                           # Main experiment (5 activations)
β”œβ”€β”€ tutorial_experiments.py            # Extended experiments (8 activations)
β”œβ”€β”€ train_dynamics.py                  # Training dynamics analysis
β”‚
β”œβ”€β”€ learned_functions.png              # Predictions vs ground truth
β”œβ”€β”€ loss_curves.png                    # Training loss over epochs
β”œβ”€β”€ gradient_flow.png                  # Gradient magnitude per layer
β”œβ”€β”€ hidden_activations.png             # Activation patterns
β”œβ”€β”€ summary_figure.png                 # 9-panel comprehensive summary
β”‚
β”œβ”€β”€ exp1_gradient_flow.png             # Extended gradient analysis
β”œβ”€β”€ exp2_activation_distributions.png  # Activation distribution analysis
β”œβ”€β”€ exp2_sparsity_dead_neurons.png     # Sparsity and dead neuron analysis
β”œβ”€β”€ exp3_stability.png                 # Training stability analysis
β”œβ”€β”€ exp4_predictions.png               # Function approximation comparison
β”œβ”€β”€ exp4_representational_heatmap.png  # Representational capacity heatmap
β”‚
β”œβ”€β”€ activation_evolution.png           # Activation evolution during training
β”œβ”€β”€ gradient_evolution.png             # Gradient evolution during training
β”œβ”€β”€ training_dynamics_functions.png    # Training dynamics visualization
β”œβ”€β”€ training_dynamics_summary.png      # Training dynamics summary
β”‚
β”œβ”€β”€ loss_histories.json                # Raw loss data
β”œβ”€β”€ gradient_magnitudes.json           # Gradient measurements
β”œβ”€β”€ gradient_magnitudes_epochs.json    # Gradient evolution data
β”œβ”€β”€ exp1_gradient_flow.json            # Extended gradient data
└── final_losses.json                  # Final MSE per activation
```

---

## πŸ“– Key Insights

### Why Sigmoid Fails in Deep Networks

The **vanishing gradient problem** occurs because:

1. **Sigmoid derivative is bounded**: max(Οƒ'(x)) = 0.25 at x=0
2. **Chain rule multiplies gradients**: For 10 layers, gradient β‰ˆ (0.25)¹⁰ β‰ˆ 10⁻⁢
3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers

```python
# Theoretical gradient decay for Sigmoid
gradient_layer_10 = gradient_output * (0.25)^10
                  β‰ˆ gradient_output * 0.000001
                  β‰ˆ 0  # Effectively zero!
```

### Why ReLU Works

ReLU maintains **unit gradient** for positive inputs:

```python
# ReLU gradient
f'(x) = 1 if x > 0 else 0

# No multiplicative decay!
gradient_layer_10 β‰ˆ gradient_output * 1^10 = gradient_output
```

### Practical Recommendations

| Use Case | Recommended |
|----------|-------------|
| Default choice | ReLU or Leaky ReLU |
| Transformers/LLMs | GELU |
| Very deep networks | Leaky ReLU + skip connections |
| Output (classification) | Sigmoid/Softmax |
| Output (regression) | Linear |

---

## πŸ“š Extended Experiments

The `tutorial_experiments.py` script includes 4 additional experiments:

1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers
2. **Activation Distributions** - Sparsity and dead neuron analysis
3. **Training Stability** - Learning rate and depth sensitivity
4. **Representational Capacity** - Multiple target function approximation

---

## πŸ”— References

- [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
- [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
- [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
- [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)

---

## πŸ“„ Citation

```bibtex
@misc{activation_functions_analysis,
  title={Activation Functions: Deep Neural Network Analysis},
  author={Orchestra Research},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/AmberLJC/activation_functions}
}
```

---

## πŸ“œ License

MIT License - feel free to use for education and research!

---

*Generated by Orchestra Research Assistant*