| # π§ Activation Functions: Deep Neural Network Analysis | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://www.python.org/downloads/) | |
| [](https://pytorch.org/) | |
| > **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.** | |
| This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks. | |
| --- | |
| ## π― Key Findings | |
| | Activation | Final MSE | Gradient Ratio (L10/L1) | Status | | |
| |------------|-----------|-------------------------|--------| | |
| | **ReLU** | **0.008** | 1.93 (stable) | β Excellent | | |
| | **Leaky ReLU** | **0.008** | 0.72 (stable) | β Excellent | | |
| | **GELU** | **0.008** | 0.83 (stable) | β Excellent | | |
| | Linear | 0.213 | 0.84 (stable) | β οΈ Cannot learn non-linearity | | |
| | Sigmoid | 0.518 | **2.59Γ10β·** (vanishing!) | β Failed | | |
| ### π¬ The Vanishing Gradient Problem - Visualized | |
| ``` | |
| Sigmoid Network (10 layers): | |
| Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 5.04Γ10β»ΒΉ | |
| Layer 5 ββββββββββββ Gradient: 1.02Γ10β»β΄ | |
| Layer 10 β Gradient: 1.94Γ10β»βΈ β 26 MILLION times smaller! | |
| ReLU Network (10 layers): | |
| Layer 1 ββββββββββββββββββββββββββββββββββββββββ Gradient: 2.70Γ10β»Β³ | |
| Layer 5 ββββββββββββββββββββββββββββββββββββββ Gradient: 2.10Γ10β»Β³ | |
| Layer 10 ββββββββββββββββββββββββββββββββββββββββ Gradient: 1.36Γ10β»Β³ β Healthy flow! | |
| ``` | |
| --- | |
| ## π Visual Results | |
| ### Learned Functions | |
|  | |
| *ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.* | |
| ### Training Dynamics | |
|  | |
| ### Gradient Flow Analysis | |
|  | |
| ### Comprehensive Summary | |
|  | |
| --- | |
| ## π§ͺ Experimental Setup | |
| ### Architecture | |
| - **Network**: 10 hidden layers Γ 64 neurons each | |
| - **Task**: 1D non-linear regression (sine wave approximation) | |
| - **Dataset**: `y = sin(x) + Ξ΅`, where `x β [-Ο, Ο]` and `Ξ΅ ~ N(0, 0.1)` | |
| ### Training Configuration | |
| ```python | |
| optimizer = Adam(lr=0.001) | |
| loss_fn = MSELoss() | |
| epochs = 500 | |
| batch_size = full_batch (200 samples) | |
| seed = 42 | |
| ``` | |
| ### Activation Functions Tested | |
| | Function | Formula | Gradient Range | | |
| |----------|---------|----------------| | |
| | Linear | `f(x) = x` | Always 1 | | |
| | Sigmoid | `f(x) = 1/(1+eβ»Λ£)` | (0, 0.25] | | |
| | ReLU | `f(x) = max(0, x)` | {0, 1} | | |
| | Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} | | |
| | GELU | `f(x) = xΒ·Ξ¦(x)` | Smooth, ~(0, 1) | | |
| --- | |
| ## π Quick Start | |
| ### Installation | |
| ```bash | |
| git clone https://huggingface.co/AmberLJC/activation_functions | |
| cd activation_functions | |
| pip install torch numpy matplotlib | |
| ``` | |
| ### Run the Experiment | |
| ```bash | |
| # Basic 5-activation comparison | |
| python train.py | |
| # Extended tutorial with 8 activations and 4 experiments | |
| python tutorial_experiments.py | |
| # Training dynamics analysis | |
| python train_dynamics.py | |
| ``` | |
| --- | |
| ## π Repository Structure | |
| ``` | |
| activation_functions/ | |
| βββ README.md # This file | |
| βββ report.md # Detailed analysis report | |
| βββ activation_tutorial.md # Educational tutorial | |
| β | |
| βββ train.py # Main experiment (5 activations) | |
| βββ tutorial_experiments.py # Extended experiments (8 activations) | |
| βββ train_dynamics.py # Training dynamics analysis | |
| β | |
| βββ learned_functions.png # Predictions vs ground truth | |
| βββ loss_curves.png # Training loss over epochs | |
| βββ gradient_flow.png # Gradient magnitude per layer | |
| βββ hidden_activations.png # Activation patterns | |
| βββ summary_figure.png # 9-panel comprehensive summary | |
| β | |
| βββ exp1_gradient_flow.png # Extended gradient analysis | |
| βββ exp2_activation_distributions.png # Activation distribution analysis | |
| βββ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis | |
| βββ exp3_stability.png # Training stability analysis | |
| βββ exp4_predictions.png # Function approximation comparison | |
| βββ exp4_representational_heatmap.png # Representational capacity heatmap | |
| β | |
| βββ activation_evolution.png # Activation evolution during training | |
| βββ gradient_evolution.png # Gradient evolution during training | |
| βββ training_dynamics_functions.png # Training dynamics visualization | |
| βββ training_dynamics_summary.png # Training dynamics summary | |
| β | |
| βββ loss_histories.json # Raw loss data | |
| βββ gradient_magnitudes.json # Gradient measurements | |
| βββ gradient_magnitudes_epochs.json # Gradient evolution data | |
| βββ exp1_gradient_flow.json # Extended gradient data | |
| βββ final_losses.json # Final MSE per activation | |
| ``` | |
| --- | |
| ## π Key Insights | |
| ### Why Sigmoid Fails in Deep Networks | |
| The **vanishing gradient problem** occurs because: | |
| 1. **Sigmoid derivative is bounded**: max(Ο'(x)) = 0.25 at x=0 | |
| 2. **Chain rule multiplies gradients**: For 10 layers, gradient β (0.25)ΒΉβ° β 10β»βΆ | |
| 3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers | |
| ```python | |
| # Theoretical gradient decay for Sigmoid | |
| gradient_layer_10 = gradient_output * (0.25)^10 | |
| β gradient_output * 0.000001 | |
| β 0 # Effectively zero! | |
| ``` | |
| ### Why ReLU Works | |
| ReLU maintains **unit gradient** for positive inputs: | |
| ```python | |
| # ReLU gradient | |
| f'(x) = 1 if x > 0 else 0 | |
| # No multiplicative decay! | |
| gradient_layer_10 β gradient_output * 1^10 = gradient_output | |
| ``` | |
| ### Practical Recommendations | |
| | Use Case | Recommended | | |
| |----------|-------------| | |
| | Default choice | ReLU or Leaky ReLU | | |
| | Transformers/LLMs | GELU | | |
| | Very deep networks | Leaky ReLU + skip connections | | |
| | Output (classification) | Sigmoid/Softmax | | |
| | Output (regression) | Linear | | |
| --- | |
| ## π Extended Experiments | |
| The `tutorial_experiments.py` script includes 4 additional experiments: | |
| 1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers | |
| 2. **Activation Distributions** - Sparsity and dead neuron analysis | |
| 3. **Training Stability** - Learning rate and depth sensitivity | |
| 4. **Representational Capacity** - Multiple target function approximation | |
| --- | |
| ## π References | |
| - [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/) | |
| - [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html) | |
| - [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852) | |
| - [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415) | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @misc{activation_functions_analysis, | |
| title={Activation Functions: Deep Neural Network Analysis}, | |
| author={Orchestra Research}, | |
| year={2024}, | |
| publisher={HuggingFace}, | |
| url={https://huggingface.co/AmberLJC/activation_functions} | |
| } | |
| ``` | |
| --- | |
| ## π License | |
| MIT License - feel free to use for education and research! | |
| --- | |
| *Generated by Orchestra Research Assistant* | |