Medical Cost – Regression, Clustering & Classification

Link to my video:

https://www.youtube.com/watch?v=s9wRsR2CN6A

Link to my dataset from kaggle:

https://www.kaggle.com/datasets/mohankrishnathalla/medical-insurance-cost-prediction

This project was completed as part of Assignment #2 – Classification, Regression, Clustering, Evaluation.
The goal is to predict patients’ annual medical cost based on their demographic, clinical, utilization and policy characteristics, and then re-frame the problem as a multi-class classification task (low / medium / high cost).

short summary:

We analyze a health-insurance dataset to predict each patient’s annual medical cost. After performing EDA and feature engineering (including standardized financial/utilization variables and a KMeans-based risk_cluster feature), we compare a baseline Linear Regression model with Random Forest and Gradient Boosting regressors. Random Forest has the lowest MAE, but Gradient Boosting achieves the best RMSE and R² and is therefore selected as our final model. We then convert the problem into a three-class classification task (low, medium and high cost) and train Logistic Regression, Random Forest and Gradient Boosting classifiers. Gradient Boosting again provides the best Accuracy and macro-F1, and is selected as the final winning model.


Part 1: Dataset & Problem Definition

  • Synthetic health insurance / medical claims dataset (~100K rows, 54 columns) from kaggle.
  • Target (regression): annual_medical_cost (continuous).
  • Example feature groups:
    • Demographics: age, sex, region, income
    • Insurance & payments: plan_id, network_tier_*, monthly_premium, annual_premium, total_claims_paid
    • Risk & utilization: risk_score, claims_count, proc_surgery_count, had_major_procedure, chronic_count, is_high_risk, comorbidity flags (e.g. diabetes, liver_disease)

The data is split into 80% train and 20% test and this split is used consistently for all models.


Part 2: Exploratory Data Analysis (EDA)

Main questions:

  • How are annual medical costs distributed?
  • How do costs vary by risk and utilization?
  • Which variables are most correlated with cost?

2.1 Numerical features

We summarize key numeric variables:

  • age, bmi, risk_score
  • annual_medical_cost, total_claims_paid, claims_count

and visualize them with histograms.

Findings:

  • age and bmi are roughly bell-shaped around their means.
  • annual_medical_cost, total_claims_paid and claims_count are strongly right-skewed
    → most patients have moderate cost, but a small group is extremely expensive.

Figure 1: Histograms of the six numeric features.

image image

2.2 Categorical features

We examine distributions of:

  • sex, smoker, region

using relative frequencies and bar plots.

Findings:

  • sex is roughly balanced.
  • Most individuals are never-smokers.
  • The South region has the largest share of patients.

Figure 2: Bar charts for sex, smoker and region. image

2.3 Correlation with annual cost

We compute correlations of all numeric features with annual_medical_cost and show the top 10 in a horizontal bar plot.

Findings:

  • Strongest positive correlations for:
    • monthly_premium, annual_premium
    • total_claims_paid
    • avg_claim_amount (if present)
  • Clinical and risk variables, such as risk_score and chronic_count, show moderate positive correlation.

Figure 3: Top correlations with annual_medical_cost. image

2.4 Cost vs. categories

We use boxplots for:

  • annual_medical_cost by smoker status.
  • annual_medical_cost by is_high_risk_label (derived from is_high_risk).

Findings:

  • Current and former smokers have higher median costs and greater variability than never-smokers.
  • High-risk patients show much higher median costs and a wider spread than non high-risk patients.

Figure 4: Boxplots of annual cost by smoker status. image

Figure 5: Boxplots of annual cost by high-risk status. image


Part 3: Baseline Regression Model

Model: LinearRegression trained on the original encoded features (X_encoded).

  • Evaluation on the test set:
    • MAE ≈ 320
    • RMSE ≈ 575
    • R² ≈ 0.966

A scatter plot of true vs. predicted cost (not mandatory for the assignment) shows that the linear model captures the global trend, but errors become large for high-cost patients, motivating more advanced models.


Part 4: Feature Engineering

To improve model performance and capture non-linear patterns, several engineered features are added.

4.1 Manual features

Examples:

  • cost_per_claim = total_claims_paid / claims_count
  • risk_x_premium = risk_score * monthly_premium
  • has_chronic = indicator if chronic_count > 0
  • age_group = age binned into categories (e.g. 0–19, 20–39, …)

These features aim to compress important relationships such as “cost per utilization” and interactions between risk and premiums.

4.2 Scaling

Key numeric features are standardized using StandardScaler, producing new *_scaled columns, e.g.:

  • income_scaled, annual_premium_scaled, monthly_premium_scaled
  • total_claims_paid_scaled, claims_count_scaled, risk_score_scaled

This helps distance-based and tree-based models and keeps magnitudes comparable.

4.3 Clustering feature (risk_cluster)

We apply KMeans (n_clusters=4) on scaled versions of:

  • age, risk_score, chronic_count, claims_count, annual_premium

The resulting cluster assignments are stored in:

  • risk_cluster (categorical, values 0–3).

This groups patients into 4 risk-based clusters with similar profiles and is used later as a categorical feature (one-hot encoded).

(Optional improvement for visualization): add a 2D PCA scatter plot colored by risk_cluster to show cluster separation.

4.4 Final feature matrix

  • Drop target and ID/helper columns (e.g. annual_medical_cost, person_id, helper labels).
  • One-hot encode categoricals (including age_group and risk_cluster).
  • Resulting engineered feature matrix: ~100,000 rows × 87 feature columns.

Part 5: Improved Regression Models

Using the engineered features (X_fe_encoded) with the same 80/20 split, three regression models are trained:

  1. Linear Regression (FE) – Linear model with engineered features.
  2. Random Forest Regressor (FE) – Captures non-linear relationships using many trees.
  3. Gradient Boosting Regressor (FE) – Boosting of weak trees for high accuracy.

Test performance:

  • Linear Regression (FE)

    • MAE ≈ 314
    • RMSE ≈ 569
    • R² ≈ 0.967
  • Random Forest Regressor (FE)

    • MAE ≈ 51.88
    • RMSE ≈ 403.25
    • R² ≈ 0.983
  • Gradient Boosting Regressor (FE)

    • MAE ≈ 97.44
    • RMSE ≈ 237.64
    • R² ≈ 0.994

Figure 6: Table (results_fe) comparing MAE, RMSE and R² for the three improved models. image


Part 6: Winning Regression Model

Although Random Forest achieves the lowest MAE, Gradient Boosting achieves the best RMSE and highest R², which put more weight on large errors.

Since in a medical cost setting large under/over-estimates are particularly problematic, RMSE and R² are chosen as the main criteria.

  • Winning regression model: Gradient Boosting Regressor with engineered features.

The model is also exported to a pickle file and uploaded to Hugging Face for reuse.


Part 7: Regression → Classification (Creating Classes)

To convert the task into a 3-class classification problem, the continuous target is binned using train-set quantiles:

  • Class 0 – low cost: annual_medical_cost < 1,432
  • Class 1 – medium cost: 1,432 ≤ cost < 2,951
  • Class 2 – high cost: cost ≥ 2,951

The thresholds are computed only on the train set and then applied to the test set to avoid data leakage.

Class distributions on both train and test are roughly balanced (about 1/3 in each class), so no special rebalancing techniques are needed. Global metrics such as accuracy and macro F1 are therefore meaningful.


Part 8: Classification Models & Evaluation (Winning Classifier)

Using the same engineered features, three classification models are trained:

  1. Logistic Regression
  2. Random Forest Classifier
  3. Gradient Boosting Classifier

For each model, we compute a classification report (precision, recall, F1) and show a confusion matrix.

8.1 Test performance (summary)

Model Accuracy Macro F1
Logistic Regression ~0.82 ~0.82
Random Forest ~0.93 ~0.93
Gradient Boosting ~0.99 ~0.99

Gradient Boosting Classifier clearly outperforms the others, with very few misclassifications for any of the three cost classes.

Because the classes are balanced and all of them are important, macro F1 is used as the main comparison metric (not only raw accuracy).

Figures 7–9: Confusion matrices for Logistic Regression, Random Forest and Gradient Boosting. image image image

Figure 10: Table comparing Accuracy and macro F1 for all three classifiers. image

8.2 Winning classifier

  • Final chosen model: Gradient Boosting Classifier trained on the engineered features.
  • Saved as: medical_cost_gb_classifier.pkl (and uploaded to Hugging Face).

Key Insights (Summary)

  • Annual medical costs and utilization features are highly skewed: most patients are cheap, but a minority is extremely expensive.
  • Financial features (premiums, total claims) show the strongest correlation with cost; risk and chronic conditions also matter.
  • Feature engineering (manual features, scaling, and the risk_cluster feature) helps non-linear models capture complex patterns.
  • For regression, Gradient Boosting offers the best overall performance on RMSE/R², while Random Forest yields the lowest MAE.
  • For classification (low/medium/high cost), Gradient Boosting Classifier achieves the highest Accuracy and macro F1 and is therefore selected as the final winning model.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support