# ๐Ÿง  Activation Functions: Deep Neural Network Analysis [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/) > **Empirical evidence for the vanishing gradient problem and why modern activations (ReLU, GELU) dominate deep learning.** This repository provides a comprehensive comparison of 5 activation functions in deep neural networks, demonstrating the **vanishing gradient problem** with Sigmoid and why modern activations enable training of deep networks. --- ## ๐ŸŽฏ Key Findings | Activation | Final MSE | Gradient Ratio (L10/L1) | Status | |------------|-----------|-------------------------|--------| | **ReLU** | **0.008** | 1.93 (stable) | โœ… Excellent | | **Leaky ReLU** | **0.008** | 0.72 (stable) | โœ… Excellent | | **GELU** | **0.008** | 0.83 (stable) | โœ… Excellent | | Linear | 0.213 | 0.84 (stable) | โš ๏ธ Cannot learn non-linearity | | Sigmoid | 0.518 | **2.59ร—10โท** (vanishing!) | โŒ Failed | ### ๐Ÿ”ฌ The Vanishing Gradient Problem - Visualized ``` Sigmoid Network (10 layers): Layer 1 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Gradient: 5.04ร—10โปยน Layer 5 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Gradient: 1.02ร—10โปโด Layer 10 โ– Gradient: 1.94ร—10โปโธ โ† 26 MILLION times smaller! ReLU Network (10 layers): Layer 1 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Gradient: 2.70ร—10โปยณ Layer 5 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Gradient: 2.10ร—10โปยณ Layer 10 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Gradient: 1.36ร—10โปยณ โ† Healthy flow! ``` --- ## ๐Ÿ“Š Visual Results ### Learned Functions ![Learned Functions](learned_functions.png) *ReLU, Leaky ReLU, and GELU perfectly approximate the sine wave. Linear learns only a straight line. Sigmoid completely fails to learn.* ### Training Dynamics ![Loss Curves](loss_curves.png) ### Gradient Flow Analysis ![Gradient Flow](gradient_flow.png) ### Comprehensive Summary ![Summary](summary_figure.png) --- ## ๐Ÿงช Experimental Setup ### Architecture - **Network**: 10 hidden layers ร— 64 neurons each - **Task**: 1D non-linear regression (sine wave approximation) - **Dataset**: `y = sin(x) + ฮต`, where `x โˆˆ [-ฯ€, ฯ€]` and `ฮต ~ N(0, 0.1)` ### Training Configuration ```python optimizer = Adam(lr=0.001) loss_fn = MSELoss() epochs = 500 batch_size = full_batch (200 samples) seed = 42 ``` ### Activation Functions Tested | Function | Formula | Gradient Range | |----------|---------|----------------| | Linear | `f(x) = x` | Always 1 | | Sigmoid | `f(x) = 1/(1+eโปหฃ)` | (0, 0.25] | | ReLU | `f(x) = max(0, x)` | {0, 1} | | Leaky ReLU | `f(x) = max(0.01x, x)` | {0.01, 1} | | GELU | `f(x) = xยทฮฆ(x)` | Smooth, ~(0, 1) | --- ## ๐Ÿš€ Quick Start ### Installation ```bash git clone https://huggingface.co/AmberLJC/activation_functions cd activation_functions pip install torch numpy matplotlib ``` ### Run the Experiment ```bash # Basic 5-activation comparison python train.py # Extended tutorial with 8 activations and 4 experiments python tutorial_experiments.py # Training dynamics analysis python train_dynamics.py ``` --- ## ๐Ÿ“ Repository Structure ``` activation_functions/ โ”œโ”€โ”€ README.md # This file โ”œโ”€โ”€ report.md # Detailed analysis report โ”œโ”€โ”€ activation_tutorial.md # Educational tutorial โ”‚ โ”œโ”€โ”€ train.py # Main experiment (5 activations) โ”œโ”€โ”€ tutorial_experiments.py # Extended experiments (8 activations) โ”œโ”€โ”€ train_dynamics.py # Training dynamics analysis โ”‚ โ”œโ”€โ”€ learned_functions.png # Predictions vs ground truth โ”œโ”€โ”€ loss_curves.png # Training loss over epochs โ”œโ”€โ”€ gradient_flow.png # Gradient magnitude per layer โ”œโ”€โ”€ hidden_activations.png # Activation patterns โ”œโ”€โ”€ summary_figure.png # 9-panel comprehensive summary โ”‚ โ”œโ”€โ”€ exp1_gradient_flow.png # Extended gradient analysis โ”œโ”€โ”€ exp2_activation_distributions.png # Activation distribution analysis โ”œโ”€โ”€ exp2_sparsity_dead_neurons.png # Sparsity and dead neuron analysis โ”œโ”€โ”€ exp3_stability.png # Training stability analysis โ”œโ”€โ”€ exp4_predictions.png # Function approximation comparison โ”œโ”€โ”€ exp4_representational_heatmap.png # Representational capacity heatmap โ”‚ โ”œโ”€โ”€ activation_evolution.png # Activation evolution during training โ”œโ”€โ”€ gradient_evolution.png # Gradient evolution during training โ”œโ”€โ”€ training_dynamics_functions.png # Training dynamics visualization โ”œโ”€โ”€ training_dynamics_summary.png # Training dynamics summary โ”‚ โ”œโ”€โ”€ loss_histories.json # Raw loss data โ”œโ”€โ”€ gradient_magnitudes.json # Gradient measurements โ”œโ”€โ”€ gradient_magnitudes_epochs.json # Gradient evolution data โ”œโ”€โ”€ exp1_gradient_flow.json # Extended gradient data โ””โ”€โ”€ final_losses.json # Final MSE per activation ``` --- ## ๐Ÿ“– Key Insights ### Why Sigmoid Fails in Deep Networks The **vanishing gradient problem** occurs because: 1. **Sigmoid derivative is bounded**: max(ฯƒ'(x)) = 0.25 at x=0 2. **Chain rule multiplies gradients**: For 10 layers, gradient โ‰ˆ (0.25)ยนโฐ โ‰ˆ 10โปโถ 3. **Early layers don't learn**: Gradient signal vanishes before reaching input layers ```python # Theoretical gradient decay for Sigmoid gradient_layer_10 = gradient_output * (0.25)^10 โ‰ˆ gradient_output * 0.000001 โ‰ˆ 0 # Effectively zero! ``` ### Why ReLU Works ReLU maintains **unit gradient** for positive inputs: ```python # ReLU gradient f'(x) = 1 if x > 0 else 0 # No multiplicative decay! gradient_layer_10 โ‰ˆ gradient_output * 1^10 = gradient_output ``` ### Practical Recommendations | Use Case | Recommended | |----------|-------------| | Default choice | ReLU or Leaky ReLU | | Transformers/LLMs | GELU | | Very deep networks | Leaky ReLU + skip connections | | Output (classification) | Sigmoid/Softmax | | Output (regression) | Linear | --- ## ๐Ÿ“š Extended Experiments The `tutorial_experiments.py` script includes 4 additional experiments: 1. **Gradient Flow Analysis** - Depths 5, 10, 20, 50 layers 2. **Activation Distributions** - Sparsity and dead neuron analysis 3. **Training Stability** - Learning rate and depth sensitivity 4. **Representational Capacity** - Multiple target function approximation --- ## ๐Ÿ”— References - [Deep Learning Book - Chapter 6.3: Hidden Units](https://www.deeplearningbook.org/) - [Glorot & Bengio (2010): Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a.html) - [He et al. (2015): Delving Deep into Rectifiers](https://arxiv.org/abs/1502.01852) - [Hendrycks & Gimpel (2016): GELU](https://arxiv.org/abs/1606.08415) --- ## ๐Ÿ“„ Citation ```bibtex @misc{activation_functions_analysis, title={Activation Functions: Deep Neural Network Analysis}, author={Orchestra Research}, year={2024}, publisher={HuggingFace}, url={https://huggingface.co/AmberLJC/activation_functions} } ``` --- ## ๐Ÿ“œ License MIT License - feel free to use for education and research! --- *Generated by Orchestra Research Assistant*