Credit Default Prediction Model (Random Forest Classifier)

This repository contains a machine-learning model trained to predict whether a borrower will default on a loan (TARGET = 1).
The model is based on a feature-engineered version of the Home Credit Default Risk dataset and is optimized to handle extremely imbalanced data.

The final model is a Random Forest Classifier, selected after evaluating multiple algorithms.

📁 Files Included

rf_classifier.pkl — trained Random Forest model
feature_columns.json — ordered list of input feature names

Both files must be used together for inference.

🧠 Model Description

Task: Binary Classification
Labels:

0 → borrower predicted not to default
1 → borrower predicted to default

Algorithm: RandomForestClassifier

Training Configuration

RandomForestClassifier(
    n_estimators=80,
    max_depth=10,
    min_samples_leaf=100,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced"
)

Why this configuration?

class_weight="balanced": Critical for dealing with the severe class imbalance (~92% non-default vs. ~8% default).
min_samples_leaf=100: Reduces variance and prevents the model from learning noise specific to a few samples.
max_depth=10: Limits tree complexity to ensure generalization.

📊 Model Performance (Held-Out Test Set)

Metric	Score
Accuracy	~0.71
Precision	~0.165
Recall	~0.638
F1 Score	~0.263
ROC-AUC	~0.739

📝 Interpretation & Why This Model Won

Recall is King: In credit risk, missing a risky borrower (False Negative) is far more costly than flagging a safe borrower (False Positive).
Beyond Accuracy: Accuracy is misleading here; predicting "no default" for everyone would yield ~92% accuracy but 0% utility.
Performance: The Random Forest achieved the best balance of Recall (~64%) and ROC-AUC (~0.74) compared to Logistic Regression and Gradient Boosting models tested during development.

🧩 Dataset & Feature Engineering

The model was trained on a feature-engineered dataset derived from the Home Credit competition. Key feature groups include:

Financial: Credit amount, annuity, income, credit-to-income ratio.
Demographics: Age, employment years, dependents, family size.
Stability Indicators: Employment stability, relative richness.
Clustering: K-Means clustering features (CLUSTER_ID, CLUSTER_DIST) were added to capture non-linear borrower segments.

⬇️ How to Download the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="guyshilo12/home-credit-default-classifier",
    filename="rf_classifier.pkl"
)

cols_path = hf_hub_download(
    repo_id="guyshilo12/home-credit-default-classifier",
    filename="feature_columns.json"
)

📦 How to Load the Model

import pickle
import json

# Load model
with open("rf_classifier.pkl", "rb") as f:
    model = pickle.load(f)

# Load feature order
with open("feature_columns.json", "r") as f:
    feature_cols = json.load(f)

🔮 How to Run a Prediction

import pandas as pd

# Example input row (replace 0s with real feature values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])

# Predict class and probability of default
pred = model.predict(X)[0]
proba = model.predict_proba(X)[:, 1][0]

print("Predicted class:", pred)
print("Default probability:", proba)

🏆 Intended Use

This model is suitable for:

Academic / university projects

Experiments on imbalanced classification

Credit-risk proof-of-concept models

Research on feature engineering and tabular ML

Not intended for real-world lending decisions, production use, or regulatory environments without proper validation, calibration, and compliance review.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support