File size: 6,218 Bytes
854c114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 15 - Dimensionality Reduction (PCA)\n",
                "\n",
                "Welcome to Module 15! We're exploring **PCA (Principal Component Analysis)**, a technique for reducing the number of variables in your data while preserving as much information as possible.\n",
                "\n",
                "### Resources:\n",
                "Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for the Linear Algebra (Eigenvalues/Eigenvectors) behind PCA.\n",
                "\n",
                "### Objectives:\n",
                "1. **Information Compression**: Reducing features without losing pattern labels.\n",
                "2. **Visualization**: Plotting high-dimensional data in 2D or 3D.\n",
                "3. **Explained Variance**: Understanding how many components we actually need.\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Setup\n",
                "We will use the **Digits** dataset (8x8 images of handwritten digits) which flattened has 64 features."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from sklearn.datasets import load_digits\n",
                "from sklearn.preprocessing import StandardScaler\n",
                "from sklearn.decomposition import PCA\n",
                "\n",
                "# Load dataset\n",
                "digits = load_digits()\n",
                "X = digits.data\n",
                "y = digits.target\n",
                "\n",
                "print(\"Original Shape:\", X.shape)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Visualization via PCA\n",
                "\n",
                "### Task 1: 2D Projection\n",
                "Reduce the 64 features down to 2 and visualize the digits on a scatter plot.\n",
                "\n",
                "*Web Reference: Check [Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/) for how to present these results.*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "scaler = StandardScaler()\n",
                "X_scaled = scaler.fit_transform(X)\n",
                "\n",
                "pca = PCA(n_components=2)\n",
                "X_pca = pca.fit_transform(X_scaled)\n",
                "\n",
                "plt.figure(figsize=(10, 8))\n",
                "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)\n",
                "plt.colorbar(label='Digit Label')\n",
                "plt.title('Digits Dataset: 64D flattened to 2D via PCA')\n",
                "plt.xlabel('PC1')\n",
                "plt.ylabel('PC2')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Selecting Components\n",
                "\n",
                "### Task 2: Scree Plot\n",
                "Calculate the cumulative explained variance for all components and identify how many are needed to keep 95% of the information."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "pca_full = PCA().fit(X_scaled)\n",
                "plt.plot(np.cumsum(pca_full.explained_variance_ratio_))\n",
                "plt.xlabel('Number of Components')\n",
                "plt.ylabel('Cumulative Explained Variance')\n",
                "plt.axhline(y=0.95, color='r', linestyle='--')\n",
                "plt.title('Scree Plot: Finding the Elbow')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Excellent Compression! \n",
                "You've learned how to simplify complex data without losing the big picture.\n",
                "Next: **Advanced Clustering (DBSCAN & Hierarchical)**."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}