# Dimensionality Reduction: Comprehensive Implementation and Analysis

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![GitHub](https://img.shields.io/badge/github-dimensionality--reduction-black?logo=github)](https://github.com/GruheshKurra/dimensionality-reduction)
[![Hugging Face](https://img.shields.io/badge/🤗-Hugging_Face-yellow)](https://huggingface.co/karthik-2905/dimensionality-reduction)

A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets.

## 🎯 Overview

Dimensionality reduction is crucial in machine learning for:
- **Data Visualization**: Projecting high-dimensional data to 2D/3D for human interpretation
- **Computational Efficiency**: Reducing feature space for faster processing
- **Noise Reduction**: Eliminating redundant or noisy features
- **Storage Optimization**: Compressing data while preserving essential information

This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons.

## 📊 Methods Implemented

### 1. Principal Component Analysis (PCA)
- **Type**: Linear dimensionality reduction
- **Key Feature**: Finds directions of maximum variance
- **Best For**: Data with linear structure, feature compression
- **Results**: 
  - Iris: 97.5% accuracy retention with 2 components
  - Digits: 52.4% accuracy retention with 2 components

### 2. t-SNE (t-Distributed Stochastic Neighbor Embedding)
- **Type**: Non-linear manifold learning
- **Key Feature**: Preserves local neighborhood structure
- **Best For**: Data visualization, clustering analysis
- **Results**:
  - Iris: 105.0% accuracy retention
  - Digits: 100.4% accuracy retention

### 3. UMAP (Uniform Manifold Approximation and Projection)
- **Type**: Non-linear manifold learning
- **Key Feature**: Preserves both local and global structure
- **Best For**: Balanced visualization, scalable to large datasets
- **Results**:
  - Iris: 102.5% accuracy retention
  - Digits: 99.2% accuracy retention

### 4. Autoencoder (Neural Network)
- **Type**: Non-linear neural network approach
- **Key Feature**: Learns optimal encoding through reconstruction
- **Best For**: Complex non-linear relationships, customizable architectures
- **Architecture**: Input → 128 → 64 → Encoding → 64 → 128 → Output

## 🗂️ Project Structure

```
dimensionality-reduction/
├── implementation.ipynb          # Complete Jupyter notebook with theory and code
├── dimensionality_reduction.log  # Detailed execution logs
├── models/                      # Saved trained models
│   ├── pca_iris.pkl
│   ├── pca_digits.pkl
│   ├── umap_iris.pkl
│   ├── umap_digits.pkl
│   ├── autoencoder_iris.pth
│   └── autoencoder_digits.pth
├── results/                     # Analysis results
│   └── dimensionality_reduction_summary.json
├── visualizations/              # Generated plots and comparisons
│   ├── pca_explained_variance.png
│   ├── iris_comparison.png
│   └── digits_comparison.png
└── README.md                    # This file
```

## 🚀 Quick Start

### Prerequisites

```bash
pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision
```

### Running the Analysis

1. **Clone the repository**:
   ```bash
   git clone https://github.com/GruheshKurra/dimensionality-reduction.git
   cd dimensionality-reduction
   ```

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

3. **Run the complete analysis**:
   ```bash
   jupyter notebook implementation.ipynb
   ```

   Or execute the main script:
   ```bash
   python main.py
   ```

## 📈 Results Summary

### Dataset Information
- **Iris Dataset**: 150 samples, 4 features, 3 classes
- **Digits Dataset**: 1797 samples, 64 features, 10 classes

### Performance Comparison (Accuracy Retention)

| Method | Iris Dataset | Digits Dataset |
|--------|-------------|----------------|
| PCA    | 97.5%       | 52.4%         |
| t-SNE  | 105.0%      | 100.4%        |
| UMAP   | 102.5%      | 99.2%         |

### Key Insights
- **PCA** works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits)
- **t-SNE** excels at preserving local structure, sometimes even improving classification performance
- **UMAP** provides excellent balance between local and global structure preservation
- **Autoencoders** offer flexibility but require careful tuning

## 🔍 Detailed Analysis

### PCA Explained Variance
- **Iris**: First 2 components explain 95.8% of variance
- **Digits**: First 2 components explain only 21.6% of variance

### Method Characteristics

| Aspect | PCA | t-SNE | UMAP | Autoencoder |
|--------|-----|-------|------|-------------|
| Linearity | Linear | Non-linear | Non-linear | Non-linear |
| Speed | Fast | Slow | Medium | Medium |
| Deterministic | Yes | No | Yes* | Yes* |
| New Data | ✅ | ❌ | ✅ | ✅ |
| Interpretability | High | Low | Medium | Low |

*With fixed random seed

## 📖 Educational Content

The `implementation.ipynb` notebook includes:

1. **Theory Explanation**: Mathematical foundations and intuitive explanations
2. **Step-by-step Implementation**: Detailed code with comprehensive comments
3. **Visual Comparisons**: Side-by-side plots showing method differences
4. **Performance Evaluation**: Classification accuracy retention analysis
5. **Best Practices**: When to use each method and parameter selection

## 🛠️ Technical Details

### Dependencies
- `numpy`: Numerical computing
- `pandas`: Data manipulation
- `scikit-learn`: Machine learning algorithms
- `matplotlib`, `seaborn`: Data visualization
- `umap-learn`: UMAP implementation
- `torch`: Neural network autoencoder
- `plotly`: Interactive visualizations

### Key Features
- **Comprehensive Logging**: Detailed execution logs for reproducibility
- **Model Persistence**: Save and load trained models
- **Evaluation Framework**: Systematic performance comparison
- **Visualization Suite**: Publication-quality plots
- **Structured Results**: JSON summary for further analysis

## 🎓 Learning Outcomes

After working through this project, you will understand:

1. **Mathematical Foundations**: How each method works mathematically
2. **Implementation Details**: How to implement these methods from scratch
3. **Performance Trade-offs**: When to use each method
4. **Evaluation Strategies**: How to assess dimensionality reduction quality
5. **Practical Applications**: Real-world use cases and considerations

## 🤝 Contributing

Contributions are welcome! Please feel free to:

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🔗 Links

- **GitHub Repository**: [dimensionality-reduction](https://github.com/GruheshKurra/dimensionality-reduction)
- **Hugging Face Space**: [karthik-2905/dimensionality-reduction](https://huggingface.co/karthik-2905/dimensionality-reduction)
- **Documentation**: [Implementation Notebook](implementation.ipynb)

## 📞 Contact

For questions or feedback, please:
- Open an issue on GitHub
- Contact the maintainer: [Karthik](mailto:karthik@example.com)

---

**Note**: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.