{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# ML Practice Series: Module 07 - Capstone Project (Real-World Pipeline)\n",
                "\n",
                "In this project, we will apply everything we've learned—from Statistics and EDA to Model Evaluation—using a real-world dataset often found on **Kaggle**: The **Medical Cost Personal Dataset**.\n",
                "\n",
                "### Project Goal:\n",
                "Predict the individual medical costs billed by health insurance based on various user attributes (Age, Sex, BMI, Children, Smoker, Region).\n",
                "\n",
                "### Integrated Resources:\n",
                "- **Web Ref**: [Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/) (for handling 'Smoker' and 'Region' encoding).\n",
                "- **Web Ref**: [Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/) (for checking the distribution of charges).\n",
                "- **Web Ref**: [ML Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) (for choosing the right regression algorithm).\n",
                "\n",
                "---"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Data Acquisition\n",
                "We will pull the raw data directly from a public repository, similar to how you would download a CSV from Kaggle."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "from sklearn.model_selection import train_test_split\n",
                "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
                "from sklearn.ensemble import RandomForestRegressor\n",
                "from sklearn.metrics import mean_absolute_error, r2_score\n",
                "\n",
                "# Load the dataset\n",
                "url = \"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv\"\n",
                "df = pd.read_csv(url)\n",
                "\n",
                "print(\"Dataset size:\", df.shape)\n",
                "df.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Phase 1: Exploratory Data Analysis (EDA)\n",
                "\n",
                "### Task 1: Correlation Analysis\n",
                "Since we want to predict `charges`, create a heatmap to see which features (after converting categories) correlate most with medical costs."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "# Temporary encoding just to see correlations\n",
                "df_temp = df.copy()\n",
                "for col in ['sex', 'smoker', 'region']: \n",
                "    df_temp[col] = LabelEncoder().fit_transform(df_temp[col])\n",
                "\n",
                "plt.figure(figsize=(10, 8))\n",
                "sns.heatmap(df_temp.corr(), annot=True, cmap='coolwarm')\n",
                "plt.title('Feature Correlation Heatmap')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Task 2: The 'Smoker' Effect\n",
                "Visualization is key on Kaggle. Create a boxplot or violin plot showing `charges` separated by `smoker` status."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "sns.boxplot(x='smoker', y='charges', data=df)\n",
                "plt.title('Effect of Smoking on Insurance Charges')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Phase 2: Feature Engineering\n",
                "\n",
                "### Task 3: categorical Transformation\n",
                "1. Binary encode `sex` and `smoker`.\n",
                "2. One-hot encode the `region` column."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)\n",
                "print(\"New Columns:\", df.columns.tolist())\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 4. Phase 3: Modeling & Optimization\n",
                "\n",
                "### Task 4: Training & Evaluation\n",
                "Divide the data. Train a `RandomForestRegressor` and evaluate using $R^2$ and Mean Absolute Error (MAE).\n",
                "\n",
                "*Hint: Use the [Ensemble Methods Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) on your site to learn why Random Forest is great for this data.*"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "X = df.drop('charges', axis=1)\n",
                "y = df['charges']\n",
                "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
                "\n",
                "model = RandomForestRegressor(n_estimators=100, random_state=42)\n",
                "model.fit(X_train, y_train)\n",
                "\n",
                "y_pred = model.predict(X_test)\n",
                "\n",
                "print(f\"R2 Score: {r2_score(y_test, y_pred):.4f}\")\n",
                "print(f\"MAE: ${mean_absolute_error(y_test, y_pred):.2f}\")\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 5. Phase 4: Interpretation\n",
                "\n",
                "### Task 5: Feature Importances\n",
                "Which factor drives insurance prices the most? Visualize the model's feature importances."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# YOUR CODE HERE\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<details>\n",
                "<summary><b>Click to see Solution</b></summary>\n",
                "\n",
                "```python\n",
                "importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)\n",
                "sns.barplot(x=importances, y=importances.index)\n",
                "plt.title('Key Drivers of Medical Costs')\n",
                "plt.show()\n",
                "```\n",
                "</details>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "--- \n",
                "### Project Complete! \n",
                "You've just completed a full Machine Learning cycle on real-world insurance data. \n",
                "By combining the theory from your **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)** with this hands-on project, you are now ready for real Kaggle competitions!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.0"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}