Spaces:
Running
Running
File size: 6,218 Bytes
854c114 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | {
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 15 - Dimensionality Reduction (PCA)\n",
"\n",
"Welcome to Module 15! We're exploring **PCA (Principal Component Analysis)**, a technique for reducing the number of variables in your data while preserving as much information as possible.\n",
"\n",
"### Resources:\n",
"Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for the Linear Algebra (Eigenvalues/Eigenvectors) behind PCA.\n",
"\n",
"### Objectives:\n",
"1. **Information Compression**: Reducing features without losing pattern labels.\n",
"2. **Visualization**: Plotting high-dimensional data in 2D or 3D.\n",
"3. **Explained Variance**: Understanding how many components we actually need.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"We will use the **Digits** dataset (8x8 images of handwritten digits) which flattened has 64 features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.datasets import load_digits\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.decomposition import PCA\n",
"\n",
"# Load dataset\n",
"digits = load_digits()\n",
"X = digits.data\n",
"y = digits.target\n",
"\n",
"print(\"Original Shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Visualization via PCA\n",
"\n",
"### Task 1: 2D Projection\n",
"Reduce the 64 features down to 2 and visualize the digits on a scatter plot.\n",
"\n",
"*Web Reference: Check [Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/) for how to present these results.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"scaler = StandardScaler()\n",
"X_scaled = scaler.fit_transform(X)\n",
"\n",
"pca = PCA(n_components=2)\n",
"X_pca = pca.fit_transform(X_scaled)\n",
"\n",
"plt.figure(figsize=(10, 8))\n",
"plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)\n",
"plt.colorbar(label='Digit Label')\n",
"plt.title('Digits Dataset: 64D flattened to 2D via PCA')\n",
"plt.xlabel('PC1')\n",
"plt.ylabel('PC2')\n",
"plt.show()\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Selecting Components\n",
"\n",
"### Task 2: Scree Plot\n",
"Calculate the cumulative explained variance for all components and identify how many are needed to keep 95% of the information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details>\n",
"<summary><b>Click to see Solution</b></summary>\n",
"\n",
"```python\n",
"pca_full = PCA().fit(X_scaled)\n",
"plt.plot(np.cumsum(pca_full.explained_variance_ratio_))\n",
"plt.xlabel('Number of Components')\n",
"plt.ylabel('Cumulative Explained Variance')\n",
"plt.axhline(y=0.95, color='r', linestyle='--')\n",
"plt.title('Scree Plot: Finding the Elbow')\n",
"plt.show()\n",
"```\n",
"</details>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Excellent Compression! \n",
"You've learned how to simplify complex data without losing the big picture.\n",
"Next: **Advanced Clustering (DBSCAN & Hierarchical)**."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
} |