Credit Default Prediction Model (Random Forest Classifier)

This repository contains a machine-learning model trained to predict whether a borrower will default on a loan (TARGET = 1).
The model is based on a feature-engineered version of the Home Credit Default Risk dataset and is optimized to handle extremely imbalanced data.

The final model is a Random Forest Classifier, selected after evaluating multiple algorithms.


๐Ÿ“ Files Included

  • rf_classifier.pkl โ€” trained Random Forest model
  • feature_columns.json โ€” ordered list of input feature names

Both files must be used together for inference.


๐Ÿง  Model Description

Task: Binary Classification
Labels:

  • 0 โ†’ borrower predicted not to default
  • 1 โ†’ borrower predicted to default

Algorithm: RandomForestClassifier

Training Configuration

RandomForestClassifier(
    n_estimators=80,
    max_depth=10,
    min_samples_leaf=100,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced"
)

Why this configuration?

  • class_weight="balanced": Critical for dealing with the severe class imbalance (~92% non-default vs. ~8% default).
  • min_samples_leaf=100: Reduces variance and prevents the model from learning noise specific to a few samples.
  • max_depth=10: Limits tree complexity to ensure generalization.

๐Ÿ“Š Model Performance (Held-Out Test Set)

Metric Score
Accuracy ~0.71
Precision ~0.165
Recall ~0.638
F1 Score ~0.263
ROC-AUC ~0.739

๐Ÿ“ Interpretation & Why This Model Won

  • Recall is King: In credit risk, missing a risky borrower (False Negative) is far more costly than flagging a safe borrower (False Positive).
  • Beyond Accuracy: Accuracy is misleading here; predicting "no default" for everyone would yield ~92% accuracy but 0% utility.
  • Performance: The Random Forest achieved the best balance of Recall (~64%) and ROC-AUC (~0.74) compared to Logistic Regression and Gradient Boosting models tested during development.

๐Ÿงฉ Dataset & Feature Engineering

The model was trained on a feature-engineered dataset derived from the Home Credit competition. Key feature groups include:

  • Financial: Credit amount, annuity, income, credit-to-income ratio.
  • Demographics: Age, employment years, dependents, family size.
  • Stability Indicators: Employment stability, relative richness.
  • Clustering: K-Means clustering features (CLUSTER_ID, CLUSTER_DIST) were added to capture non-linear borrower segments.

โฌ‡๏ธ How to Download the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="guyshilo12/home-credit-default-classifier",
    filename="rf_classifier.pkl"
)

cols_path = hf_hub_download(
    repo_id="guyshilo12/home-credit-default-classifier",
    filename="feature_columns.json"
)

๐Ÿ“ฆ How to Load the Model

import pickle
import json

# Load model
with open("rf_classifier.pkl", "rb") as f:
    model = pickle.load(f)

# Load feature order
with open("feature_columns.json", "r") as f:
    feature_cols = json.load(f)

๐Ÿ”ฎ How to Run a Prediction

import pandas as pd

# Example input row (replace 0s with real feature values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])

# Predict class and probability of default
pred = model.predict(X)[0]
proba = model.predict_proba(X)[:, 1][0]

print("Predicted class:", pred)
print("Default probability:", proba)

๐Ÿ† Intended Use

This model is suitable for:

Academic / university projects

Experiments on imbalanced classification

Credit-risk proof-of-concept models

Research on feature engineering and tabular ML

Not intended for real-world lending decisions, production use, or regulatory environments without proper validation, calibration, and compliance review.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support