YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
π§ Activation Functions: Deep Neural Network Analysis
Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.
This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the vanishing gradient problem with Sigmoid and why modern activations enable training of deep networks.
π― Key Findings
| Activation | Final MSE | Gradient Ratio (L10/L1) | Status |
|---|---|---|---|
| ReLU | 0.008 | 1.93 (stable) | β Excellent |
| Leaky ReLU | 0.008 | 0.72 (stable) | β Excellent |
| GELU | 0.008 | 0.83 (stable) | β Excellent |
| Linear | 0.213 | 0.84 (stable) | β οΈ Cannot learn non-linearity |
| Sigmoid | 0.518 | 2.59Γ10β· (vanishing!) | β Failed |
π¬ The Vanishing Gradient Problem - Visualized
Sigmoid Network (10 layers):
Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 5.04Γ10β»ΒΉ
Layer 5 ββββββββββββ Gradient: 1.02Γ10β»β΄
Layer 10 β Gradient: 1.94Γ10β»βΈ β 26 MILLION times smaller!
ReLU Network (10 layers):
Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 2.70Γ10β»Β³
Layer 5 ββββββββββββββββββββββββββββββββββββββ Gradient: 2.10Γ10β»Β³
Layer 10 ββββββββββββββββββββββββββββββββββββββββ Gradient: 1.36Γ10β»Β³ β Healthy flow!
π Visual Results
Learned Functions
ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.
Training Dynamics
Gradient Flow Analysis
Comprehensive Summary
π§ͺ Experimental Setup
Architecture
- Network: 10 hidden layers Γ 64 neurons each
- Task: 1D non-linear regression (sine wave approximation)
- Dataset:
y = sin(x) + Ξ΅, wherex β [-Ο, Ο]andΞ΅ ~ N(0, 0.1)
Training Configuration
optimizer = Adam(lr=0.001)
loss_fn = MSELoss()
epochs = 500
batch_size = full_batch (200 samples)
seed = 42
Activation Functions Tested
| Function | Formula | Gradient Range |
|---|---|---|
| Linear | f(x) = x |
Always 1 |
| Sigmoid | f(x) = 1/(1+eβ»Λ£) |
(0, 0.25] |
| ReLU | f(x) = max(0, x) |
{0, 1} |
| Leaky ReLU | f(x) = max(0.01x, x) |
{0.01, 1} |
| GELU | f(x) = xΒ·Ξ¦(x) |
Smooth, ~(0, 1) |
π Quick Start
Installation
git clone https://huggingface.co/AmberLJC/activation_functions
cd activation_functions
pip install torch numpy matplotlib
Run the Experiment
# Basic 5-activation comparison
python train.py
# Extended tutorial with 8 activations and 4 experiments
python tutorial_experiments.py
# Training dynamics analysis
python train_dynamics.py
π Repository Structure
activation_functions/
βββ README.md # This file
βββ report.md # Detailed analysis report
βββ activation_tutorial.md # Educational tutorial
β
βββ train.py # Main experiment (5 activations)
βββ tutorial_experiments.py # Extended experiments (8 activations)
βββ train_dynamics.py # Training dynamics analysis
β
βββ learned_functions.png # Predictions vs ground truth
βββ loss_curves.png # Training loss over epochs
βββ gradient_flow.png # Gradient magnitude per layer
βββ hidden_activations.png # Activation patterns
βββ summary_figure.png # 9-panel comprehensive summary
β
βββ exp1_gradient_flow.png # Extended gradient analysis
βββ exp2_activation_distributions.png # Activation distribution analysis
βββ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis
βββ exp3_stability.png # Training stability analysis
βββ exp4_predictions.png # Function approximation comparison
βββ exp4_representational_heatmap.png # Representational capacity heatmap
β
βββ activation_evolution.png # Activation evolution during training
βββ gradient_evolution.png # Gradient evolution during training
βββ training_dynamics_functions.png # Training dynamics visualization
βββ training_dynamics_summary.png # Training dynamics summary
β
βββ loss_histories.json # Raw loss data
βββ gradient_magnitudes.json # Gradient measurements
βββ gradient_magnitudes_epochs.json # Gradient evolution data
βββ exp1_gradient_flow.json # Extended gradient data
βββ final_losses.json # Final MSE per activation
π Key Insights
Why Sigmoid Fails in Deep Networks
The vanishing gradient problem occurs because:
- Sigmoid derivative is bounded: max(Ο'(x)) = 0.25 at x=0
- Chain rule multiplies gradients: For 10 layers, gradient β (0.25)ΒΉβ° β 10β»βΆ
- Early layers don't learn: Gradient signal vanishes before reaching input layers
# Theoretical gradient decay for Sigmoid
gradient_layer_10 = gradient_output * (0.25)^10
β gradient_output * 0.000001
β 0 # Effectively zero!
Why ReLU Works
ReLU maintains unit gradient for positive inputs:
# ReLU gradient
f'(x) = 1 if x > 0 else 0
# No multiplicative decay!
gradient_layer_10 β gradient_output * 1^10 = gradient_output
Practical Recommendations
| Use Case | Recommended |
|---|---|
| Default choice | ReLU or Leaky ReLU |
| Transformers/LLMs | GELU |
| Very deep networks | Leaky ReLU + skip connections |
| Output (classification) | Sigmoid/Softmax |
| Output (regression) | Linear |
π Extended Experiments
The tutorial_experiments.py script includes 4 additional experiments:
- Gradient Flow Analysis - Depths 5, 10, 20, 50 layers
- Activation Distributions - Sparsity and dead neuron analysis
- Training Stability - Learning rate and depth sensitivity
- Representational Capacity - Multiple target function approximation
π References
- Deep Learning Book - Chapter 6.3: Hidden Units
- Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks
- He et al. (2015): Delving Deep into Rectifiers
- Hendrycks & Gimpel (2016): GELU
π Citation
@misc{activation_functions_analysis,
title={Activation Functions: Deep Neural Network Analysis},
author={Orchestra Research},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/AmberLJC/activation_functions}
}
π License
MIT License - feel free to use for education and research!
Generated by Orchestra Research Assistant



