Upload folder using huggingface_hub

43fa1d2 verified 8 months ago

7.9 kB

	# Dimensionality Reduction: Comprehensive Implementation and Analysis

	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![GitHub](https://img.shields.io/badge/github-dimensionality--reduction-black?logo=github)](https://github.com/GruheshKurra/dimensionality-reduction)
	[![Hugging Face](https://img.shields.io/badge/🤗-Hugging_Face-yellow)](https://huggingface.co/karthik-2905/dimensionality-reduction)

	A comprehensive implementation and analysis of dimensionality reduction techniques including PCA, t-SNE, UMAP, and Autoencoders. This repository demonstrates the theory, implementation, and evaluation of these methods on standard datasets.

	## 🎯 Overview

	Dimensionality reduction is crucial in machine learning for:
	- Data Visualization: Projecting high-dimensional data to 2D/3D for human interpretation
	- Computational Efficiency: Reducing feature space for faster processing
	- Noise Reduction: Eliminating redundant or noisy features
	- Storage Optimization: Compressing data while preserving essential information

	This project provides a complete suite of dimensionality reduction methods with detailed explanations, implementations, and performance comparisons.

	## 📊 Methods Implemented

	### 1. Principal Component Analysis (PCA)
	- Type: Linear dimensionality reduction
	- Key Feature: Finds directions of maximum variance
	- Best For: Data with linear structure, feature compression
	- Results:
	- Iris: 97.5% accuracy retention with 2 components
	- Digits: 52.4% accuracy retention with 2 components

	### 2. t-SNE (t-Distributed Stochastic Neighbor Embedding)
	- Type: Non-linear manifold learning
	- Key Feature: Preserves local neighborhood structure
	- Best For: Data visualization, clustering analysis
	- Results:
	- Iris: 105.0% accuracy retention
	- Digits: 100.4% accuracy retention

	### 3. UMAP (Uniform Manifold Approximation and Projection)
	- Type: Non-linear manifold learning
	- Key Feature: Preserves both local and global structure
	- Best For: Balanced visualization, scalable to large datasets
	- Results:
	- Iris: 102.5% accuracy retention
	- Digits: 99.2% accuracy retention

	### 4. Autoencoder (Neural Network)
	- Type: Non-linear neural network approach
	- Key Feature: Learns optimal encoding through reconstruction
	- Best For: Complex non-linear relationships, customizable architectures
	- Architecture: Input → 128 → 64 → Encoding → 64 → 128 → Output

	## 🗂️ Project Structure

	```
	dimensionality-reduction/
	├── implementation.ipynb # Complete Jupyter notebook with theory and code
	├── dimensionality_reduction.log # Detailed execution logs
	├── models/ # Saved trained models
	│ ├── pca_iris.pkl
	│ ├── pca_digits.pkl
	│ ├── umap_iris.pkl
	│ ├── umap_digits.pkl
	│ ├── autoencoder_iris.pth
	│ └── autoencoder_digits.pth
	├── results/ # Analysis results
	│ └── dimensionality_reduction_summary.json
	├── visualizations/ # Generated plots and comparisons
	│ ├── pca_explained_variance.png
	│ ├── iris_comparison.png
	│ └── digits_comparison.png
	└── README.md # This file
	```

	## 🚀 Quick Start

	### Prerequisites

	```bash
	pip install numpy pandas scikit-learn matplotlib seaborn plotly umap-learn torch torchvision
	```

	### Running the Analysis

	1. Clone the repository:
	```bash
	git clone https://github.com/GruheshKurra/dimensionality-reduction.git
	cd dimensionality-reduction
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Run the complete analysis:
	```bash
	jupyter notebook implementation.ipynb
	```

	Or execute the main script:
	```bash
	python main.py
	```

	## 📈 Results Summary

	### Dataset Information
	- Iris Dataset: 150 samples, 4 features, 3 classes
	- Digits Dataset: 1797 samples, 64 features, 10 classes

	### Performance Comparison (Accuracy Retention)

	\| Method \| Iris Dataset \| Digits Dataset \|
	\|--------\|-------------\|----------------\|
	\| PCA \| 97.5% \| 52.4% \|
	\| t-SNE \| 105.0% \| 100.4% \|
	\| UMAP \| 102.5% \| 99.2% \|

	### Key Insights
	- PCA works well for low-dimensional data (Iris) but struggles with high-dimensional complex patterns (Digits)
	- t-SNE excels at preserving local structure, sometimes even improving classification performance
	- UMAP provides excellent balance between local and global structure preservation
	- Autoencoders offer flexibility but require careful tuning

	## 🔍 Detailed Analysis

	### PCA Explained Variance
	- Iris: First 2 components explain 95.8% of variance
	- Digits: First 2 components explain only 21.6% of variance

	### Method Characteristics

	\| Aspect \| PCA \| t-SNE \| UMAP \| Autoencoder \|
	\|--------\|-----\|-------\|------\|-------------\|
	\| Linearity \| Linear \| Non-linear \| Non-linear \| Non-linear \|
	\| Speed \| Fast \| Slow \| Medium \| Medium \|
	\| Deterministic \| Yes \| No \| Yes* \| Yes* \|
	\| New Data \| ✅ \| ❌ \| ✅ \| ✅ \|
	\| Interpretability \| High \| Low \| Medium \| Low \|

	*With fixed random seed

	## 📖 Educational Content

	The `implementation.ipynb` notebook includes:

	1. Theory Explanation: Mathematical foundations and intuitive explanations
	2. Step-by-step Implementation: Detailed code with comprehensive comments
	3. Visual Comparisons: Side-by-side plots showing method differences
	4. Performance Evaluation: Classification accuracy retention analysis
	5. Best Practices: When to use each method and parameter selection

	## 🛠️ Technical Details

	### Dependencies
	- `numpy`: Numerical computing
	- `pandas`: Data manipulation
	- `scikit-learn`: Machine learning algorithms
	- `matplotlib`, `seaborn`: Data visualization
	- `umap-learn`: UMAP implementation
	- `torch`: Neural network autoencoder
	- `plotly`: Interactive visualizations

	### Key Features
	- Comprehensive Logging: Detailed execution logs for reproducibility
	- Model Persistence: Save and load trained models
	- Evaluation Framework: Systematic performance comparison
	- Visualization Suite: Publication-quality plots
	- Structured Results: JSON summary for further analysis

	## 🎓 Learning Outcomes

	After working through this project, you will understand:

	1. Mathematical Foundations: How each method works mathematically
	2. Implementation Details: How to implement these methods from scratch
	3. Performance Trade-offs: When to use each method
	4. Evaluation Strategies: How to assess dimensionality reduction quality
	5. Practical Applications: Real-world use cases and considerations

	## 🤝 Contributing

	Contributions are welcome! Please feel free to:

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests if applicable
	5. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## 🔗 Links

	- GitHub Repository: [dimensionality-reduction](https://github.com/GruheshKurra/dimensionality-reduction)
	- Hugging Face Space: [karthik-2905/dimensionality-reduction](https://huggingface.co/karthik-2905/dimensionality-reduction)
	- Documentation: [Implementation Notebook](implementation.ipynb)

	## 📞 Contact

	For questions or feedback, please:
	- Open an issue on GitHub
	- Contact the maintainer: [Karthik](mailto:karthik@example.com)

	---

	Note: This is an educational project designed to demonstrate dimensionality reduction techniques. The implementations prioritize clarity and understanding over performance optimization.