karthik-2905's picture
Upload folder using huggingface_hub
43fa1d2 verified
# Dimensionality Reduction: Comprehensive Implementation and Analysis
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![GitHub](https://img.shields.io/badge/github-dimensionality--reduction-black?logo=github)](https://github.com/GruheshKurra/dimensionality-reduction)
[![Hugging Face](https://img.shields.io/badge/πŸ€—-Hugging_Face-yellow)](https://huggingface.co/karthik-2905/dimensionality-reduction)
A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets.
## 🎯 Overview
Dimensionality reduction is crucial in machine learning for:
- **Data Visualization**: Projecting high-dimensional data to 2D/3D for human interpretation
- **Computational Efficiency**: Reducing feature space for faster processing
- **Noise Reduction**: Eliminating redundant or noisy features
- **Storage Optimization**: Compressing data while preserving essential information
This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons.
## πŸ“Š Methods Implemented
### 1. Principal Component Analysis (PCA)
- **Type**: Linear dimensionality reduction
- **Key Feature**: Finds directions of maximum variance
- **Best For**: Data with linear structure, feature compression
- **Results**:
- Iris: 97.5% accuracy retention with 2 components
- Digits: 52.4% accuracy retention with 2 components
### 2. t-SNE (t-Distributed Stochastic Neighbor Embedding)
- **Type**: Non-linear manifold learning
- **Key Feature**: Preserves local neighborhood structure
- **Best For**: Data visualization, clustering analysis
- **Results**:
- Iris: 105.0% accuracy retention
- Digits: 100.4% accuracy retention
### 3. UMAP (Uniform Manifold Approximation and Projection)
- **Type**: Non-linear manifold learning
- **Key Feature**: Preserves both local and global structure
- **Best For**: Balanced visualization, scalable to large datasets
- **Results**:
- Iris: 102.5% accuracy retention
- Digits: 99.2% accuracy retention
### 4. Autoencoder (Neural Network)
- **Type**: Non-linear neural network approach
- **Key Feature**: Learns optimal encoding through reconstruction
- **Best For**: Complex non-linear relationships, customizable architectures
- **Architecture**: Input β†’ 128 β†’ 64 β†’ Encoding β†’ 64 β†’ 128 β†’ Output
## πŸ—‚οΈ Project Structure
```
dimensionality-reduction/
β”œβ”€β”€ implementation.ipynb # Complete Jupyter notebook with theory and code
β”œβ”€β”€ dimensionality_reduction.log # Detailed execution logs
β”œβ”€β”€ models/ # Saved trained models
β”‚ β”œβ”€β”€ pca_iris.pkl
β”‚ β”œβ”€β”€ pca_digits.pkl
β”‚ β”œβ”€β”€ umap_iris.pkl
β”‚ β”œβ”€β”€ umap_digits.pkl
β”‚ β”œβ”€β”€ autoencoder_iris.pth
β”‚ └── autoencoder_digits.pth
β”œβ”€β”€ results/ # Analysis results
β”‚ └── dimensionality_reduction_summary.json
β”œβ”€β”€ visualizations/ # Generated plots and comparisons
β”‚ β”œβ”€β”€ pca_explained_variance.png
β”‚ β”œβ”€β”€ iris_comparison.png
β”‚ └── digits_comparison.png
└── README.md # This file
```
## πŸš€ Quick Start
### Prerequisites
```bash
pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision
```
### Running the Analysis
1. **Clone the repository**:
```bash
git clone https://github.com/GruheshKurra/dimensionality-reduction.git
cd dimensionality-reduction
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Run the complete analysis**:
```bash
jupyter notebook implementation.ipynb
```
Or execute the main script:
```bash
python main.py
```
## πŸ“ˆ Results Summary
### Dataset Information
- **Iris Dataset**: 150 samples, 4 features, 3 classes
- **Digits Dataset**: 1797 samples, 64 features, 10 classes
### Performance Comparison (Accuracy Retention)
| Method | Iris Dataset | Digits Dataset |
|--------|-------------|----------------|
| PCA | 97.5% | 52.4% |
| t-SNE | 105.0% | 100.4% |
| UMAP | 102.5% | 99.2% |
### Key Insights
- **PCA** works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits)
- **t-SNE** excels at preserving local structure, sometimes even improving classification performance
- **UMAP** provides excellent balance between local and global structure preservation
- **Autoencoders** offer flexibility but require careful tuning
## πŸ” Detailed Analysis
### PCA Explained Variance
- **Iris**: First 2 components explain 95.8% of variance
- **Digits**: First 2 components explain only 21.6% of variance
### Method Characteristics
| Aspect | PCA | t-SNE | UMAP | Autoencoder |
|--------|-----|-------|------|-------------|
| Linearity | Linear | Non-linear | Non-linear | Non-linear |
| Speed | Fast | Slow | Medium | Medium |
| Deterministic | Yes | No | Yes* | Yes* |
| New Data | βœ… | ❌ | βœ… | βœ… |
| Interpretability | High | Low | Medium | Low |
*With fixed random seed
## πŸ“– Educational Content
The `implementation.ipynb` notebook includes:
1. **Theory Explanation**: Mathematical foundations and intuitive explanations
2. **Step-by-step Implementation**: Detailed code with comprehensive comments
3. **Visual Comparisons**: Side-by-side plots showing method differences
4. **Performance Evaluation**: Classification accuracy retention analysis
5. **Best Practices**: When to use each method and parameter selection
## πŸ› οΈ Technical Details
### Dependencies
- `numpy`: Numerical computing
- `pandas`: Data manipulation
- `scikit-learn`: Machine learning algorithms
- `matplotlib`, `seaborn`: Data visualization
- `umap-learn`: UMAP implementation
- `torch`: Neural network autoencoder
- `plotly`: Interactive visualizations
### Key Features
- **Comprehensive Logging**: Detailed execution logs for reproducibility
- **Model Persistence**: Save and load trained models
- **Evaluation Framework**: Systematic performance comparison
- **Visualization Suite**: Publication-quality plots
- **Structured Results**: JSON summary for further analysis
## πŸŽ“ Learning Outcomes
After working through this project, you will understand:
1. **Mathematical Foundations**: How each method works mathematically
2. **Implementation Details**: How to implement these methods from scratch
3. **Performance Trade-offs**: When to use each method
4. **Evaluation Strategies**: How to assess dimensionality reduction quality
5. **Practical Applications**: Real-world use cases and considerations
## 🀝 Contributing
Contributions are welcome! Please feel free to:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## πŸ“„ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## πŸ”— Links
- **GitHub Repository**: [dimensionality-reduction](https://github.com/GruheshKurra/dimensionality-reduction)
- **Hugging Face Space**: [karthik-2905/dimensionality-reduction](https://huggingface.co/karthik-2905/dimensionality-reduction)
- **Documentation**: [Implementation Notebook](implementation.ipynb)
## πŸ“ž Contact
For questions or feedback, please:
- Open an issue on GitHub
- Contact the maintainer: [Karthik](mailto:karthik@example.com)
---
**Note**: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.