Spaces:

AashishAIHub
/

DataScience

Running

File size: 6,218 Bytes

854c114

{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 15 - Dimensionality Reduction (PCA)\n",
                "\n",
                "Welcome to Module 15! We're exploring **PCA (Principal Component Analysis)**, a technique for reducing the number of variables in your data while preserving as much information as possible.\n",
                "\n",
                "### Resources:\n",
                "Refer to the **[Mathematics for Data Science](https://aashishgarg13.github.io/DataScience/math-ds-complete/)** section on your hub for the Linear Algebra (Eigenvalues/Eigenvectors) behind PCA.\n",
                "\n",
                "### Objectives:\n",
                "1. **Information Compression**: Reducing features without losing pattern labels.\n",
                "2. **Visualization**: Plotting high-dimensional data in 2D or 3D.\n",
                "3. **Explained Variance**: Understanding how many components we actually need.\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Setup\n",
                "We will use the **Digits** dataset (8x8 images of handwritten digits) which flattened has 64 features."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from sklearn.datasets import load_digits\n",
                "from sklearn.preprocessing import StandardScaler\n",
                "from sklearn.decomposition import PCA\n",
                "\n",
                "# Load dataset\n",
                "digits = load_digits()\n",
                "X = digits.data\n",
                "y = digits.target\n",
                "\n",
                "print(\"Original Shape:\", X.shape)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Visualization via PCA\n",
                "\n",
                "### Task 1: 2D Projection\n",
                "Reduce the 64 features down to 2 and visualize the digits on a scatter plot.\n",
                "\n",
                "*Web Reference: Check [Data Visualization](https://aashishgarg13.github.io/DataScience/Visualization/) for how to present these results.*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "scaler = StandardScaler()\n",
                "X_scaled = scaler.fit_transform(X)\n",
                "\n",
                "pca = PCA(n_components=2)\n",
                "X_pca = pca.fit_transform(X_scaled)\n",
                "\n",
                "plt.figure(figsize=(10, 8))\n",
                "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)\n",
                "plt.colorbar(label='Digit Label')\n",
                "plt.title('Digits Dataset: 64D flattened to 2D via PCA')\n",
                "plt.xlabel('PC1')\n",
                "plt.ylabel('PC2')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Selecting Components\n",
                "\n",
                "### Task 2: Scree Plot\n",
                "Calculate the cumulative explained variance for all components and identify how many are needed to keep 95% of the information."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "pca_full = PCA().fit(X_scaled)\n",
                "plt.plot(np.cumsum(pca_full.explained_variance_ratio_))\n",
                "plt.xlabel('Number of Components')\n",
                "plt.ylabel('Cumulative Explained Variance')\n",
                "plt.axhline(y=0.95, color='r', linestyle='--')\n",
                "plt.title('Scree Plot: Finding the Elbow')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Excellent Compression! \n",
                "You've learned how to simplify complex data without losing the big picture.\n",
                "Next: **Advanced Clustering (DBSCAN & Hierarchical)**."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}