{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 07 - Capstone Project (Real-World Pipeline)\n",
"\n",
"In this project, we will apply everything we've learned—from Statistics and EDA to Model Evaluation—using a real-world dataset often found on **Kaggle**: The **Medical Cost Personal Dataset**.\n",
"\n",
"### Project Goal:\n",
"Predict the individual medical costs billed by health insurance based on various user attributes (Age, Sex, BMI, Children, Smoker, Region).\n",
"\n",
"### Integrated Resources:\n",
"- **Web Ref**: [Feature Engineering Guide](https://aashishgarg13.github.io/DataScience/feature-engineering/) (for handling 'Smoker' and 'Region' encoding).\n",
"- **Web Ref**: [Statistics Course](https://aashishgarg13.github.io/DataScience/complete-statistics/) (for checking the distribution of charges).\n",
"- **Web Ref**: [ML Guide](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) (for choosing the right regression algorithm).\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Data Acquisition\n",
"We will pull the raw data directly from a public repository, similar to how you would download a CSV from Kaggle."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_absolute_error, r2_score\n",
"\n",
"# Load the dataset\n",
"url = \"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv\"\n",
"df = pd.read_csv(url)\n",
"\n",
"print(\"Dataset size:\", df.shape)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Phase 1: Exploratory Data Analysis (EDA)\n",
"\n",
"### Task 1: Correlation Analysis\n",
"Since we want to predict `charges`, create a heatmap to see which features (after converting categories) correlate most with medical costs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"# Temporary encoding just to see correlations\n",
"df_temp = df.copy()\n",
"for col in ['sex', 'smoker', 'region']: \n",
" df_temp[col] = LabelEncoder().fit_transform(df_temp[col])\n",
"\n",
"plt.figure(figsize=(10, 8))\n",
"sns.heatmap(df_temp.corr(), annot=True, cmap='coolwarm')\n",
"plt.title('Feature Correlation Heatmap')\n",
"plt.show()\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2: The 'Smoker' Effect\n",
"Visualization is key on Kaggle. Create a boxplot or violin plot showing `charges` separated by `smoker` status."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"sns.boxplot(x='smoker', y='charges', data=df)\n",
"plt.title('Effect of Smoking on Insurance Charges')\n",
"plt.show()\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Phase 2: Feature Engineering\n",
"\n",
"### Task 3: categorical Transformation\n",
"1. Binary encode `sex` and `smoker`.\n",
"2. One-hot encode the `region` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)\n",
"print(\"New Columns:\", df.columns.tolist())\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Phase 3: Modeling & Optimization\n",
"\n",
"### Task 4: Training & Evaluation\n",
"Divide the data. Train a `RandomForestRegressor` and evaluate using $R^2$ and Mean Absolute Error (MAE).\n",
"\n",
"*Hint: Use the [Ensemble Methods Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/) on your site to learn why Random Forest is great for this data.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"X = df.drop('charges', axis=1)\n",
"y = df['charges']\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"model = RandomForestRegressor(n_estimators=100, random_state=42)\n",
"model.fit(X_train, y_train)\n",
"\n",
"y_pred = model.predict(X_test)\n",
"\n",
"print(f\"R2 Score: {r2_score(y_test, y_pred):.4f}\")\n",
"print(f\"MAE: ${mean_absolute_error(y_test, y_pred):.2f}\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Phase 4: Interpretation\n",
"\n",
"### Task 5: Feature Importances\n",
"Which factor drives insurance prices the most? Visualize the model's feature importances."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)\n",
"sns.barplot(x=importances, y=importances.index)\n",
"plt.title('Key Drivers of Medical Costs')\n",
"plt.show()\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Project Complete! \n",
"You've just completed a full Machine Learning cycle on real-world insurance data. \n",
"By combining the theory from your **[DataScience Learning Hub](https://aashishgarg13.github.io/DataScience/)** with this hands-on project, you are now ready for real Kaggle competitions!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}