Credit Default Prediction Model (Random Forest Classifier)
This repository contains a machine-learning model trained to predict whether a borrower will default on a loan (TARGET = 1).
The model is based on a feature-engineered version of the Home Credit Default Risk dataset and is optimized to handle extremely imbalanced data.
The final model is a Random Forest Classifier, selected after evaluating multiple algorithms.
๐ Files Included
- rf_classifier.pkl โ trained Random Forest model
- feature_columns.json โ ordered list of input feature names
Both files must be used together for inference.
๐ง Model Description
Task: Binary Classification
Labels:
0โ borrower predicted not to default1โ borrower predicted to default
Algorithm: RandomForestClassifier
Training Configuration
RandomForestClassifier(
n_estimators=80,
max_depth=10,
min_samples_leaf=100,
n_jobs=-1,
random_state=42,
class_weight="balanced"
)
Why this configuration?
class_weight="balanced": Critical for dealing with the severe class imbalance (~92% non-default vs. ~8% default).min_samples_leaf=100: Reduces variance and prevents the model from learning noise specific to a few samples.max_depth=10: Limits tree complexity to ensure generalization.
๐ Model Performance (Held-Out Test Set)
| Metric | Score |
|---|---|
| Accuracy | ~0.71 |
| Precision | ~0.165 |
| Recall | ~0.638 |
| F1 Score | ~0.263 |
| ROC-AUC | ~0.739 |
๐ Interpretation & Why This Model Won
- Recall is King: In credit risk, missing a risky borrower (False Negative) is far more costly than flagging a safe borrower (False Positive).
- Beyond Accuracy: Accuracy is misleading here; predicting "no default" for everyone would yield ~92% accuracy but 0% utility.
- Performance: The Random Forest achieved the best balance of Recall (~64%) and ROC-AUC (~0.74) compared to Logistic Regression and Gradient Boosting models tested during development.
๐งฉ Dataset & Feature Engineering
The model was trained on a feature-engineered dataset derived from the Home Credit competition. Key feature groups include:
- Financial: Credit amount, annuity, income, credit-to-income ratio.
- Demographics: Age, employment years, dependents, family size.
- Stability Indicators: Employment stability, relative richness.
- Clustering: K-Means clustering features (
CLUSTER_ID,CLUSTER_DIST) were added to capture non-linear borrower segments.
โฌ๏ธ How to Download the Model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="guyshilo12/home-credit-default-classifier",
filename="rf_classifier.pkl"
)
cols_path = hf_hub_download(
repo_id="guyshilo12/home-credit-default-classifier",
filename="feature_columns.json"
)
๐ฆ How to Load the Model
import pickle
import json
# Load model
with open("rf_classifier.pkl", "rb") as f:
model = pickle.load(f)
# Load feature order
with open("feature_columns.json", "r") as f:
feature_cols = json.load(f)
๐ฎ How to Run a Prediction
import pandas as pd
# Example input row (replace 0s with real feature values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])
# Predict class and probability of default
pred = model.predict(X)[0]
proba = model.predict_proba(X)[:, 1][0]
print("Predicted class:", pred)
print("Default probability:", proba)
๐ Intended Use
This model is suitable for:
Academic / university projects
Experiments on imbalanced classification
Credit-risk proof-of-concept models
Research on feature engineering and tabular ML
Not intended for real-world lending decisions, production use, or regulatory environments without proper validation, calibration, and compliance review.