{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 07 - Capstone Project (Real-World Pipeline)\n", "\n", "In this project, we will apply everything we've learned—from Statistics and EDA to Model Evaluation—using a real-world dataset often found on **Kaggle**: The **Medical Cost Personal Dataset**.\n", "\n", "### Project Goal:\n", "Predict the individual medical costs billed by health insurance based on various user attributes (Age, Sex, BMI, Children, Smoker, Region).\n", "\n", "### Integrated Resources:\n", "- **Web Ref**: [Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/) (for handling 'Smoker' and 'Region' encoding).\n", "- **Web Ref**: [Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/) (for checking the distribution of charges).\n", "- **Web Ref**: [ML Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) (for choosing the right regression algorithm).\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Data Acquisition\n", "We will pull the raw data directly from a public repository, similar to how you would download a CSV from Kaggle." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_absolute_error, r2_score\n", "\n", "# Load the dataset\n", "url = \"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv\"\n", "df = pd.read_csv(url)\n", "\n", "print(\"Dataset size:\", df.shape)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Phase 1: Exploratory Data Analysis (EDA)\n", "\n", "### Task 1: Correlation Analysis\n", "Since we want to predict `charges`, create a heatmap to see which features (after converting categories) correlate most with medical costs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "# Temporary encoding just to see correlations\n", "df_temp = df.copy()\n", "for col in ['sex', 'smoker', 'region']: \n", " df_temp[col] = LabelEncoder().fit_transform(df_temp[col])\n", "\n", "plt.figure(figsize=(10, 8))\n", "sns.heatmap(df_temp.corr(), annot=True, cmap='coolwarm')\n", "plt.title('Feature Correlation Heatmap')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2: The 'Smoker' Effect\n", "Visualization is key on Kaggle. Create a boxplot or violin plot showing `charges` separated by `smoker` status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "sns.boxplot(x='smoker', y='charges', data=df)\n", "plt.title('Effect of Smoking on Insurance Charges')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Phase 2: Feature Engineering\n", "\n", "### Task 3: categorical Transformation\n", "1. Binary encode `sex` and `smoker`.\n", "2. One-hot encode the `region` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)\n", "print(\"New Columns:\", df.columns.tolist())\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Phase 3: Modeling & Optimization\n", "\n", "### Task 4: Training & Evaluation\n", "Divide the data. Train a `RandomForestRegressor` and evaluate using $R^2$ and Mean Absolute Error (MAE).\n", "\n", "*Hint: Use the [Ensemble Methods Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) on your site to learn why Random Forest is great for this data.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "X = df.drop('charges', axis=1)\n", "y = df['charges']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "model = RandomForestRegressor(n_estimators=100, random_state=42)\n", "model.fit(X_train, y_train)\n", "\n", "y_pred = model.predict(X_test)\n", "\n", "print(f\"R2 Score: {r2_score(y_test, y_pred):.4f}\")\n", "print(f\"MAE: ${mean_absolute_error(y_test, y_pred):.2f}\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Phase 4: Interpretation\n", "\n", "### Task 5: Feature Importances\n", "Which factor drives insurance prices the most? Visualize the model's feature importances." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)\n", "sns.barplot(x=importances, y=importances.index)\n", "plt.title('Key Drivers of Medical Costs')\n", "plt.show()\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### Project Complete! \n", "You've just completed a full Machine Learning cycle on real-world insurance data. \n", "By combining the theory from your **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)** with this hands-on project, you are now ready for real Kaggle competitions!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }