AmberLJC commited on
Commit
b032934
Β·
verified Β·
1 Parent(s): e0497ce

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +227 -0
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Activation Functions: Deep Neural Network Analysis
2
+
3
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
5
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
6
+
7
+ > **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.**
8
+
9
+ This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks.
10
+
11
+ ---
12
+
13
+ ## 🎯 Key Findings
14
+
15
+ | Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
16
+ |------------|-----------|-------------------------|--------|
17
+ | **ReLU** | **0.008** | 1.93 (stable) | βœ… Excellent |
18
+ | **Leaky ReLU** | **0.008** | 0.72 (stable) | βœ… Excellent |
19
+ | **GELU** | **0.008** | 0.83 (stable) | βœ… Excellent |
20
+ | Linear | 0.213 | 0.84 (stable) | ⚠️ Cannot learn non-linearity |
21
+ | Sigmoid | 0.518 | **2.59Γ—10⁷** (vanishing!) | ❌ Failed |
22
+
23
+ ### πŸ”¬ The Vanishing Gradient Problem - Visualized
24
+
25
+ ```
26
+ Sigmoid Network (10 layers):
27
+ Layer 1 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 5.04Γ—10⁻¹
28
+ Layer 5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 1.02Γ—10⁻⁴
29
+ Layer 10 ▏ Gradient: 1.94Γ—10⁻⁸ ← 26 MILLION times smaller!
30
+
31
+ ReLU Network (10 layers):
32
+ Layer 1 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 2.70Γ—10⁻³
33
+ Layer 5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 2.10Γ—10⁻³
34
+ Layer 10 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Gradient: 1.36Γ—10⁻³ ← Healthy flow!
35
+ ```
36
+
37
+ ---
38
+
39
+ ## πŸ“Š Visual Results
40
+
41
+ ### Learned Functions
42
+ ![Learned Functions](learned_functions.png)
43
+
44
+ *ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.*
45
+
46
+ ### Training Dynamics
47
+ ![Loss Curves](loss_curves.png)
48
+
49
+ ### Gradient Flow Analysis
50
+ ![Gradient Flow](gradient_flow.png)
51
+
52
+ ### Comprehensive Summary
53
+ ![Summary](summary_figure.png)
54
+
55
+ ---
56
+
57
+ ## πŸ§ͺ Experimental Setup
58
+
59
+ ### Architecture
60
+ - **Network**: 10 hidden layers Γ— 64 neurons each
61
+ - **Task**: 1D non-linear regression (sine wave approximation)
62
+ - **Dataset**: `y = sin(x) + Ξ΅`, where `x ∈ [-Ο€, Ο€]` and `Ξ΅ ~ N(0, 0.1)`
63
+
64
+ ### Training Configuration
65
+ ```python
66
+ optimizer = Adam(lr=0.001)
67
+ loss_fn = MSELoss()
68
+ epochs = 500
69
+ batch_size = full_batch (200 samples)
70
+ seed = 42
71
+ ```
72
+
73
+ ### Activation Functions Tested
74
+ | Function | Formula | Gradient Range |
75
+ |----------|---------|----------------|
76
+ | Linear | `f(x) = x` | Always 1 |
77
+ | Sigmoid | `f(x) = 1/(1+e⁻ˣ)` | (0, 0.25] |
78
+ | ReLU | `f(x) = max(0, x)` | {0, 1} |
79
+ | Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} |
80
+ | GELU | `f(x) = xΒ·Ξ¦(x)` | Smooth, ~(0, 1) |
81
+
82
+ ---
83
+
84
+ ## πŸš€ Quick Start
85
+
86
+ ### Installation
87
+ ```bash
88
+ git clone https://huggingface.co/AmberLJC/activation_functions
89
+ cd activation_functions
90
+ pip install torch numpy matplotlib
91
+ ```
92
+
93
+ ### Run the Experiment
94
+ ```bash
95
+ # Basic 5-activation comparison
96
+ python train.py
97
+
98
+ # Extended tutorial with 8 activations and 4 experiments
99
+ python tutorial_experiments.py
100
+
101
+ # Training dynamics analysis
102
+ python train_dynamics.py
103
+ ```
104
+
105
+ ---
106
+
107
+ ## πŸ“ Repository Structure
108
+
109
+ ```
110
+ activation_functions/
111
+ β”œβ”€β”€ README.md # This file
112
+ β”œβ”€β”€ report.md # Detailed analysis report
113
+ β”œβ”€β”€ activation_tutorial.md # Educational tutorial
114
+ β”‚
115
+ β”œβ”€β”€ train.py # Main experiment (5 activations)
116
+ β”œβ”€β”€ tutorial_experiments.py # Extended experiments (8 activations)
117
+ β”œβ”€β”€ train_dynamics.py # Training dynamics analysis
118
+ β”‚
119
+ β”œβ”€β”€ learned_functions.png # Predictions vs ground truth
120
+ β”œβ”€β”€ loss_curves.png # Training loss over epochs
121
+ β”œβ”€β”€ gradient_flow.png # Gradient magnitude per layer
122
+ β”œβ”€β”€ hidden_activations.png # Activation patterns
123
+ β”œβ”€β”€ summary_figure.png # 9-panel comprehensive summary
124
+ β”‚
125
+ β”œβ”€β”€ exp1_gradient_flow.png # Extended gradient analysis
126
+ β”œβ”€β”€ exp2_activation_distributions.png # Activation distribution analysis
127
+ β”œβ”€β”€ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis
128
+ β”œβ”€β”€ exp3_stability.png # Training stability analysis
129
+ β”œβ”€β”€ exp4_predictions.png # Function approximation comparison
130
+ β”œβ”€β”€ exp4_representational_heatmap.png # Representational capacity heatmap
131
+ β”‚
132
+ β”œβ”€β”€ activation_evolution.png # Activation evolution during training
133
+ β”œβ”€β”€ gradient_evolution.png # Gradient evolution during training
134
+ β”œβ”€β”€ training_dynamics_functions.png # Training dynamics visualization
135
+ β”œβ”€β”€ training_dynamics_summary.png # Training dynamics summary
136
+ β”‚
137
+ β”œβ”€β”€ loss_histories.json # Raw loss data
138
+ β”œβ”€β”€ gradient_magnitudes.json # Gradient measurements
139
+ β”œβ”€β”€ gradient_magnitudes_epochs.json # Gradient evolution data
140
+ β”œβ”€β”€ exp1_gradient_flow.json # Extended gradient data
141
+ └── final_losses.json # Final MSE per activation
142
+ ```
143
+
144
+ ---
145
+
146
+ ## πŸ“– Key Insights
147
+
148
+ ### Why Sigmoid Fails in Deep Networks
149
+
150
+ The **vanishing gradient problem** occurs because:
151
+
152
+ 1. **Sigmoid derivative is bounded**: max(Οƒ'(x)) = 0.25 at x=0
153
+ 2. **Chain rule multiplies gradients**: For 10 layers, gradient β‰ˆ (0.25)¹⁰ β‰ˆ 10⁻⁢
154
+ 3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers
155
+
156
+ ```python
157
+ # Theoretical gradient decay for Sigmoid
158
+ gradient_layer_10 = gradient_output * (0.25)^10
159
+ β‰ˆ gradient_output * 0.000001
160
+ β‰ˆ 0 # Effectively zero!
161
+ ```
162
+
163
+ ### Why ReLU Works
164
+
165
+ ReLU maintains **unit gradient** for positive inputs:
166
+
167
+ ```python
168
+ # ReLU gradient
169
+ f'(x) = 1 if x > 0 else 0
170
+
171
+ # No multiplicative decay!
172
+ gradient_layer_10 β‰ˆ gradient_output * 1^10 = gradient_output
173
+ ```
174
+
175
+ ### Practical Recommendations
176
+
177
+ | Use Case | Recommended |
178
+ |----------|-------------|
179
+ | Default choice | ReLU or Leaky ReLU |
180
+ | Transformers/LLMs | GELU |
181
+ | Very deep networks | Leaky ReLU + skip connections |
182
+ | Output (classification) | Sigmoid/Softmax |
183
+ | Output (regression) | Linear |
184
+
185
+ ---
186
+
187
+ ## πŸ“š Extended Experiments
188
+
189
+ The `tutorial_experiments.py` script includes 4 additional experiments:
190
+
191
+ 1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers
192
+ 2. **Activation Distributions** - Sparsity and dead neuron analysis
193
+ 3. **Training Stability** - Learning rate and depth sensitivity
194
+ 4. **Representational Capacity** - Multiple target function approximation
195
+
196
+ ---
197
+
198
+ ## πŸ”— References
199
+
200
+ - [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/)
201
+ - [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html)
202
+ - [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852)
203
+ - [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415)
204
+
205
+ ---
206
+
207
+ ## πŸ“„ Citation
208
+
209
+ ```bibtex
210
+ @misc{activation_functions_analysis,
211
+ title={Activation Functions: Deep Neural Network Analysis},
212
+ author={Orchestra Research},
213
+ year={2024},
214
+ publisher={HuggingFace},
215
+ url={https://huggingface.co/AmberLJC/activation_functions}
216
+ }
217
+ ```
218
+
219
+ ---
220
+
221
+ ## πŸ“œ License
222
+
223
+ MIT License - feel free to use for education and research!
224
+
225
+ ---
226
+
227
+ *Generated by Orchestra Research Assistant*