{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 01 - EDA & Feature Engineering\n",
                "\n",
                "Welcome to the first module of your Machine Learning practice! \n",
                "\n",
                "In this notebook, we will focus on the most critical part of the ML pipeline: **Understanding and Preparing your data.**\n",
                "\n",
                "### Resources:\n",
                "This practice guide is integrated with your [DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/). Specifically, you can refer to the **Feature Engineering Guide** section on the website for interactive visual explanations of these concepts.\n",
                "\n",
                "### Objectives:\n",
                "1. **EDA**: Visualize distributions, correlations, and outliers.\n",
                "2. **Data Cleaning**: Handle missing values and data inconsistencies.\n",
                "3. **Feature Engineering**: Create new features and transform existing ones (Encoding, Scaling).\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Environment Setup\n",
                "First, let's load the necessary libraries and the dataset. We'll use the **Titanic Dataset** for this exercise."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Dataset Shape: (891, 15)\n"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>survived</th>\n",
                            "      <th>pclass</th>\n",
                            "      <th>sex</th>\n",
                            "      <th>age</th>\n",
                            "      <th>sibsp</th>\n",
                            "      <th>parch</th>\n",
                            "      <th>fare</th>\n",
                            "      <th>embarked</th>\n",
                            "      <th>class</th>\n",
                            "      <th>who</th>\n",
                            "      <th>adult_male</th>\n",
                            "      <th>deck</th>\n",
                            "      <th>embark_town</th>\n",
                            "      <th>alive</th>\n",
                            "      <th>alone</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>0</td>\n",
                            "      <td>3</td>\n",
                            "      <td>male</td>\n",
                            "      <td>22.0</td>\n",
                            "      <td>1</td>\n",
                            "      <td>0</td>\n",
                            "      <td>7.2500</td>\n",
                            "      <td>S</td>\n",
                            "      <td>Third</td>\n",
                            "      <td>man</td>\n",
                            "      <td>True</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>Southampton</td>\n",
                            "      <td>no</td>\n",
                            "      <td>False</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>female</td>\n",
                            "      <td>38.0</td>\n",
                            "      <td>1</td>\n",
                            "      <td>0</td>\n",
                            "      <td>71.2833</td>\n",
                            "      <td>C</td>\n",
                            "      <td>First</td>\n",
                            "      <td>woman</td>\n",
                            "      <td>False</td>\n",
                            "      <td>C</td>\n",
                            "      <td>Cherbourg</td>\n",
                            "      <td>yes</td>\n",
                            "      <td>False</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1</td>\n",
                            "      <td>3</td>\n",
                            "      <td>female</td>\n",
                            "      <td>26.0</td>\n",
                            "      <td>0</td>\n",
                            "      <td>0</td>\n",
                            "      <td>7.9250</td>\n",
                            "      <td>S</td>\n",
                            "      <td>Third</td>\n",
                            "      <td>woman</td>\n",
                            "      <td>False</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>Southampton</td>\n",
                            "      <td>yes</td>\n",
                            "      <td>True</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>female</td>\n",
                            "      <td>35.0</td>\n",
                            "      <td>1</td>\n",
                            "      <td>0</td>\n",
                            "      <td>53.1000</td>\n",
                            "      <td>S</td>\n",
                            "      <td>First</td>\n",
                            "      <td>woman</td>\n",
                            "      <td>False</td>\n",
                            "      <td>C</td>\n",
                            "      <td>Southampton</td>\n",
                            "      <td>yes</td>\n",
                            "      <td>False</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>0</td>\n",
                            "      <td>3</td>\n",
                            "      <td>male</td>\n",
                            "      <td>35.0</td>\n",
                            "      <td>0</td>\n",
                            "      <td>0</td>\n",
                            "      <td>8.0500</td>\n",
                            "      <td>S</td>\n",
                            "      <td>Third</td>\n",
                            "      <td>man</td>\n",
                            "      <td>True</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>Southampton</td>\n",
                            "      <td>no</td>\n",
                            "      <td>True</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \\\n",
                            "0         0       3    male  22.0      1      0   7.2500        S  Third   \n",
                            "1         1       1  female  38.0      1      0  71.2833        C  First   \n",
                            "2         1       3  female  26.0      0      0   7.9250        S  Third   \n",
                            "3         1       1  female  35.0      1      0  53.1000        S  First   \n",
                            "4         0       3    male  35.0      0      0   8.0500        S  Third   \n",
                            "\n",
                            "     who  adult_male deck  embark_town alive  alone  \n",
                            "0    man        True  NaN  Southampton    no  False  \n",
                            "1  woman       False    C    Cherbourg   yes  False  \n",
                            "2  woman       False  NaN  Southampton   yes   True  \n",
                            "3  woman       False    C  Southampton   yes  False  \n",
                            "4    man        True  NaN  Southampton    no   True  "
                        ]
                    },
                    "execution_count": 1,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "\n",
                "# Load dataset\n",
                "df = sns.load_dataset('titanic')\n",
                "print(\"Dataset Shape:\", df.shape)\n",
                "df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Part 1: Exploratory Data Analysis (EDA)\n",
                "\n",
                "### Task 1: Basic Statistics and Info\n",
                "Check the data types, non-null counts, and summary statistics."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "print(df.info())\n",
                "print(df.describe())\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Task 2: Missing Value Analysis\n",
                "Find the percentage of missing values in each column."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "missing_pct = (df.isnull().sum() / len(df)) * 100\n",
                "print(missing_pct)\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Task 3: Visualizing Distributions\n",
                "Plot the distribution of `age` and the count of `survived`."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "plt.figure(figsize=(12, 5))\n",
                "plt.subplot(1, 2, 1)\n",
                "sns.histplot(df['age'].dropna(), kde=True)\n",
                "plt.title('Age Distribution')\n",
                "\n",
                "plt.subplot(1, 2, 2)\n",
                "sns.countplot(x='survived', data=df)\n",
                "plt.title('Survival Count')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Part 2: Data Cleaning\n",
                "\n",
                "### Task 4: Handling Missing Values\n",
                "1. Fill missing `age` values with the median.\n",
                "2. Fill missing `embarked` values with the mode.\n",
                "3. Drop the `deck` column as it has too many missing values.\n",
                "\n",
                "*Hint: Visit the [Feature Engineering Guide - Missing Data](https://aashishgarg13.github.io/DataScience/feature-engineering/#missing-data) to see visual differences between Mean, Median, and KNN imputation.*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "df['age'] = df['age'].fillna(df['age'].median())\n",
                "df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])\n",
                "df.drop('deck', axis=1, inplace=True)\n",
                "print(\"Missing values after cleaning:\\n\", df.isnull().sum())\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 4. Part 3: Feature Engineering\n",
                "\n",
                "### Task 5: Creating New Features\n",
                "Create a new column `family_size` by adding `sibsp` and `parch` (plus 1 for the passenger themselves)."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "df['family_size'] = df['sibsp'] + df['parch'] + 1\n",
                "df[['sibsp', 'parch', 'family_size']].head()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Task 6: Encoding Categorical Variables\n",
                "Convert `sex` and `embarked` into numerical values using One-Hot Encoding.\n",
                "\n",
                "*Hint: Learn about Label vs One-Hot Encoding in the [Encoding Section](https://aashishgarg13.github.io/DataScience/feature-engineering/#encoding) of your learning hub.*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "df = pd.get_dummies(df, columns=['sex', 'embarked'], drop_first=True)\n",
                "df.head()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Great Job! \n",
                "You have completed the EDA and Feature Engineering module. \n",
                "In the next module, we will apply **Linear Regression** to predict a continuous variable."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "base",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.12.7"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}