{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 15 - Dimensionality Reduction (PCA)\n", "\n", "Welcome to Module 15! We're exploring **PCA (Principal Component Analysis)**, a technique for reducing the number of variables in your data while preserving as much information as possible.\n", "\n", "### Resources:\n", "Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for the Linear Algebra (Eigenvalues/Eigenvectors) behind PCA.\n", "\n", "### Objectives:\n", "1. **Information Compression**: Reducing features without losing pattern labels.\n", "2. **Visualization**: Plotting high-dimensional data in 2D or 3D.\n", "3. **Explained Variance**: Understanding how many components we actually need.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup\n", "We will use the **Digits** dataset (8x8 images of handwritten digits) which flattened has 64 features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.datasets import load_digits\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "\n", "# Load dataset\n", "digits = load_digits()\n", "X = digits.data\n", "y = digits.target\n", "\n", "print(\"Original Shape:\", X.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Visualization via PCA\n", "\n", "### Task 1: 2D Projection\n", "Reduce the 64 features down to 2 and visualize the digits on a scatter plot.\n", "\n", "*Web Reference: Check [Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/) for how to present these results.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "scaler = StandardScaler()\n", "X_scaled = scaler.fit_transform(X)\n", "\n", "pca = PCA(n_components=2)\n", "X_pca = pca.fit_transform(X_scaled)\n", "\n", "plt.figure(figsize=(10, 8))\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)\n", "plt.colorbar(label='Digit Label')\n", "plt.title('Digits Dataset: 64D flattened to 2D via PCA')\n", "plt.xlabel('PC1')\n", "plt.ylabel('PC2')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Selecting Components\n", "\n", "### Task 2: Scree Plot\n", "Calculate the cumulative explained variance for all components and identify how many are needed to keep 95% of the information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "pca_full = PCA().fit(X_scaled)\n", "plt.plot(np.cumsum(pca_full.explained_variance_ratio_))\n", "plt.xlabel('Number of Components')\n", "plt.ylabel('Cumulative Explained Variance')\n", "plt.axhline(y=0.95, color='r', linestyle='--')\n", "plt.title('Scree Plot: Finding the Elbow')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### Excellent Compression! \n", "You've learned how to simplify complex data without losing the big picture.\n", "Next: **Advanced Clustering (DBSCAN & Hierarchical)**." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }