# Dimensionality Reduction: Comprehensive Implementation and Analysis [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![GitHub](https://img.shields.io/badge/github-dimensionality--reduction-black?logo=github)](https://github.com/GruheshKurra/dimensionality-reduction) [![Hugging Face](https://img.shields.io/badge/🤗-Hugging_Face-yellow)](https://huggingface.co/karthik-2905/dimensionality-reduction) A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets. ## 🎯 Overview Dimensionality reduction is crucial in machine learning for: - **Data Visualization**: Projecting high-dimensional data to 2D/3D for human interpretation - **Computational Efficiency**: Reducing feature space for faster processing - **Noise Reduction**: Eliminating redundant or noisy features - **Storage Optimization**: Compressing data while preserving essential information This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons. ## 📊 Methods Implemented ### 1. Principal Component Analysis (PCA) - **Type**: Linear dimensionality reduction - **Key Feature**: Finds directions of maximum variance - **Best For**: Data with linear structure, feature compression - **Results**: - Iris: 97.5% accuracy retention with 2 components - Digits: 52.4% accuracy retention with 2 components ### 2. t-SNE (t-Distributed Stochastic Neighbor Embedding) - **Type**: Non-linear manifold learning - **Key Feature**: Preserves local neighborhood structure - **Best For**: Data visualization, clustering analysis - **Results**: - Iris: 105.0% accuracy retention - Digits: 100.4% accuracy retention ### 3. UMAP (Uniform Manifold Approximation and Projection) - **Type**: Non-linear manifold learning - **Key Feature**: Preserves both local and global structure - **Best For**: Balanced visualization, scalable to large datasets - **Results**: - Iris: 102.5% accuracy retention - Digits: 99.2% accuracy retention ### 4. Autoencoder (Neural Network) - **Type**: Non-linear neural network approach - **Key Feature**: Learns optimal encoding through reconstruction - **Best For**: Complex non-linear relationships, customizable architectures - **Architecture**: Input → 128 → 64 → Encoding → 64 → 128 → Output ## 🗂️ Project Structure ``` dimensionality-reduction/ ├── implementation.ipynb # Complete Jupyter notebook with theory and code ├── dimensionality_reduction.log # Detailed execution logs ├── models/ # Saved trained models │ ├── pca_iris.pkl │ ├── pca_digits.pkl │ ├── umap_iris.pkl │ ├── umap_digits.pkl │ ├── autoencoder_iris.pth │ └── autoencoder_digits.pth ├── results/ # Analysis results │ └── dimensionality_reduction_summary.json ├── visualizations/ # Generated plots and comparisons │ ├── pca_explained_variance.png │ ├── iris_comparison.png │ └── digits_comparison.png └── README.md # This file ``` ## 🚀 Quick Start ### Prerequisites ```bash pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision ``` ### Running the Analysis 1. **Clone the repository**: ```bash git clone https://github.com/GruheshKurra/dimensionality-reduction.git cd dimensionality-reduction ``` 2. **Install dependencies**: ```bash pip install -r requirements.txt ``` 3. **Run the complete analysis**: ```bash jupyter notebook implementation.ipynb ``` Or execute the main script: ```bash python main.py ``` ## 📈 Results Summary ### Dataset Information - **Iris Dataset**: 150 samples, 4 features, 3 classes - **Digits Dataset**: 1797 samples, 64 features, 10 classes ### Performance Comparison (Accuracy Retention) | Method | Iris Dataset | Digits Dataset | |--------|-------------|----------------| | PCA | 97.5% | 52.4% | | t-SNE | 105.0% | 100.4% | | UMAP | 102.5% | 99.2% | ### Key Insights - **PCA** works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits) - **t-SNE** excels at preserving local structure, sometimes even improving classification performance - **UMAP** provides excellent balance between local and global structure preservation - **Autoencoders** offer flexibility but require careful tuning ## 🔍 Detailed Analysis ### PCA Explained Variance - **Iris**: First 2 components explain 95.8% of variance - **Digits**: First 2 components explain only 21.6% of variance ### Method Characteristics | Aspect | PCA | t-SNE | UMAP | Autoencoder | |--------|-----|-------|------|-------------| | Linearity | Linear | Non-linear | Non-linear | Non-linear | | Speed | Fast | Slow | Medium | Medium | | Deterministic | Yes | No | Yes* | Yes* | | New Data | ✅ | ❌ | ✅ | ✅ | | Interpretability | High | Low | Medium | Low | *With fixed random seed ## 📖 Educational Content The `implementation.ipynb` notebook includes: 1. **Theory Explanation**: Mathematical foundations and intuitive explanations 2. **Step-by-step Implementation**: Detailed code with comprehensive comments 3. **Visual Comparisons**: Side-by-side plots showing method differences 4. **Performance Evaluation**: Classification accuracy retention analysis 5. **Best Practices**: When to use each method and parameter selection ## 🛠️ Technical Details ### Dependencies - `numpy`: Numerical computing - `pandas`: Data manipulation - `scikit-learn`: Machine learning algorithms - `matplotlib`, `seaborn`: Data visualization - `umap-learn`: UMAP implementation - `torch`: Neural network autoencoder - `plotly`: Interactive visualizations ### Key Features - **Comprehensive Logging**: Detailed execution logs for reproducibility - **Model Persistence**: Save and load trained models - **Evaluation Framework**: Systematic performance comparison - **Visualization Suite**: Publication-quality plots - **Structured Results**: JSON summary for further analysis ## 🎓 Learning Outcomes After working through this project, you will understand: 1. **Mathematical Foundations**: How each method works mathematically 2. **Implementation Details**: How to implement these methods from scratch 3. **Performance Trade-offs**: When to use each method 4. **Evaluation Strategies**: How to assess dimensionality reduction quality 5. **Practical Applications**: Real-world use cases and considerations ## 🤝 Contributing Contributions are welcome! Please feel free to: 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🔗 Links - **GitHub Repository**: [dimensionality-reduction](https://github.com/GruheshKurra/dimensionality-reduction) - **Hugging Face Space**: [karthik-2905/dimensionality-reduction](https://huggingface.co/karthik-2905/dimensionality-reduction) - **Documentation**: [Implementation Notebook](implementation.ipynb) ## 📞 Contact For questions or feedback, please: - Open an issue on GitHub - Contact the maintainer: [Karthik](mailto:karthik@example.com) --- **Note**: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.