| # Dimensionality Reduction: Comprehensive Implementation and Analysis | |
| [](https://www.python.org/downloads/) | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://github.com/GruheshKurra/dimensionality-reduction) | |
| [](https://huggingface.co/karthik-2905/dimensionality-reduction) | |
| A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets. | |
| ## π― Overview | |
| Dimensionality reduction is crucial in machine learning for: | |
| - **Data Visualization**: Projecting high-dimensional data to 2D/3D for human interpretation | |
| - **Computational Efficiency**: Reducing feature space for faster processing | |
| - **Noise Reduction**: Eliminating redundant or noisy features | |
| - **Storage Optimization**: Compressing data while preserving essential information | |
| This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons. | |
| ## π Methods Implemented | |
| ### 1. Principal Component Analysis (PCA) | |
| - **Type**: Linear dimensionality reduction | |
| - **Key Feature**: Finds directions of maximum variance | |
| - **Best For**: Data with linear structure, feature compression | |
| - **Results**: | |
| - Iris: 97.5% accuracy retention with 2 components | |
| - Digits: 52.4% accuracy retention with 2 components | |
| ### 2. t-SNE (t-Distributed Stochastic Neighbor Embedding) | |
| - **Type**: Non-linear manifold learning | |
| - **Key Feature**: Preserves local neighborhood structure | |
| - **Best For**: Data visualization, clustering analysis | |
| - **Results**: | |
| - Iris: 105.0% accuracy retention | |
| - Digits: 100.4% accuracy retention | |
| ### 3. UMAP (Uniform Manifold Approximation and Projection) | |
| - **Type**: Non-linear manifold learning | |
| - **Key Feature**: Preserves both local and global structure | |
| - **Best For**: Balanced visualization, scalable to large datasets | |
| - **Results**: | |
| - Iris: 102.5% accuracy retention | |
| - Digits: 99.2% accuracy retention | |
| ### 4. Autoencoder (Neural Network) | |
| - **Type**: Non-linear neural network approach | |
| - **Key Feature**: Learns optimal encoding through reconstruction | |
| - **Best For**: Complex non-linear relationships, customizable architectures | |
| - **Architecture**: Input β 128 β 64 β Encoding β 64 β 128 β Output | |
| ## ποΈ Project Structure | |
| ``` | |
| dimensionality-reduction/ | |
| βββ implementation.ipynb # Complete Jupyter notebook with theory and code | |
| βββ dimensionality_reduction.log # Detailed execution logs | |
| βββ models/ # Saved trained models | |
| β βββ pca_iris.pkl | |
| β βββ pca_digits.pkl | |
| β βββ umap_iris.pkl | |
| β βββ umap_digits.pkl | |
| β βββ autoencoder_iris.pth | |
| β βββ autoencoder_digits.pth | |
| βββ results/ # Analysis results | |
| β βββ dimensionality_reduction_summary.json | |
| βββ visualizations/ # Generated plots and comparisons | |
| β βββ pca_explained_variance.png | |
| β βββ iris_comparison.png | |
| β βββ digits_comparison.png | |
| βββ README.md # This file | |
| ``` | |
| ## π Quick Start | |
| ### Prerequisites | |
| ```bash | |
| pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision | |
| ``` | |
| ### Running the Analysis | |
| 1. **Clone the repository**: | |
| ```bash | |
| git clone https://github.com/GruheshKurra/dimensionality-reduction.git | |
| cd dimensionality-reduction | |
| ``` | |
| 2. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Run the complete analysis**: | |
| ```bash | |
| jupyter notebook implementation.ipynb | |
| ``` | |
| Or execute the main script: | |
| ```bash | |
| python main.py | |
| ``` | |
| ## π Results Summary | |
| ### Dataset Information | |
| - **Iris Dataset**: 150 samples, 4 features, 3 classes | |
| - **Digits Dataset**: 1797 samples, 64 features, 10 classes | |
| ### Performance Comparison (Accuracy Retention) | |
| | Method | Iris Dataset | Digits Dataset | | |
| |--------|-------------|----------------| | |
| | PCA | 97.5% | 52.4% | | |
| | t-SNE | 105.0% | 100.4% | | |
| | UMAP | 102.5% | 99.2% | | |
| ### Key Insights | |
| - **PCA** works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits) | |
| - **t-SNE** excels at preserving local structure, sometimes even improving classification performance | |
| - **UMAP** provides excellent balance between local and global structure preservation | |
| - **Autoencoders** offer flexibility but require careful tuning | |
| ## π Detailed Analysis | |
| ### PCA Explained Variance | |
| - **Iris**: First 2 components explain 95.8% of variance | |
| - **Digits**: First 2 components explain only 21.6% of variance | |
| ### Method Characteristics | |
| | Aspect | PCA | t-SNE | UMAP | Autoencoder | | |
| |--------|-----|-------|------|-------------| | |
| | Linearity | Linear | Non-linear | Non-linear | Non-linear | | |
| | Speed | Fast | Slow | Medium | Medium | | |
| | Deterministic | Yes | No | Yes* | Yes* | | |
| | New Data | β | β | β | β | | |
| | Interpretability | High | Low | Medium | Low | | |
| *With fixed random seed | |
| ## π Educational Content | |
| The `implementation.ipynb` notebook includes: | |
| 1. **Theory Explanation**: Mathematical foundations and intuitive explanations | |
| 2. **Step-by-step Implementation**: Detailed code with comprehensive comments | |
| 3. **Visual Comparisons**: Side-by-side plots showing method differences | |
| 4. **Performance Evaluation**: Classification accuracy retention analysis | |
| 5. **Best Practices**: When to use each method and parameter selection | |
| ## π οΈ Technical Details | |
| ### Dependencies | |
| - `numpy`: Numerical computing | |
| - `pandas`: Data manipulation | |
| - `scikit-learn`: Machine learning algorithms | |
| - `matplotlib`, `seaborn`: Data visualization | |
| - `umap-learn`: UMAP implementation | |
| - `torch`: Neural network autoencoder | |
| - `plotly`: Interactive visualizations | |
| ### Key Features | |
| - **Comprehensive Logging**: Detailed execution logs for reproducibility | |
| - **Model Persistence**: Save and load trained models | |
| - **Evaluation Framework**: Systematic performance comparison | |
| - **Visualization Suite**: Publication-quality plots | |
| - **Structured Results**: JSON summary for further analysis | |
| ## π Learning Outcomes | |
| After working through this project, you will understand: | |
| 1. **Mathematical Foundations**: How each method works mathematically | |
| 2. **Implementation Details**: How to implement these methods from scratch | |
| 3. **Performance Trade-offs**: When to use each method | |
| 4. **Evaluation Strategies**: How to assess dimensionality reduction quality | |
| 5. **Practical Applications**: Real-world use cases and considerations | |
| ## π€ Contributing | |
| Contributions are welcome! Please feel free to: | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Add tests if applicable | |
| 5. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | |
| ## π Links | |
| - **GitHub Repository**: [dimensionality-reduction](https://github.com/GruheshKurra/dimensionality-reduction) | |
| - **Hugging Face Space**: [karthik-2905/dimensionality-reduction](https://huggingface.co/karthik-2905/dimensionality-reduction) | |
| - **Documentation**: [Implementation Notebook](implementation.ipynb) | |
| ## π Contact | |
| For questions or feedback, please: | |
| - Open an issue on GitHub | |
| - Contact the maintainer: [Karthik](mailto:karthik@example.com) | |
| --- | |
| **Note**: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization. |