Home Credit – EXT_SOURCE_2 Regression Model (Random Forest)

This repository contains the winning regression model trained to predict the external credit score (EXT_SOURCE_2) in the Home Credit Default Risk dataset.
The model uses extensive feature engineering and captures non-linear financial and demographic patterns using a Random Forest Regressor.

📁 Files Included

regression_winner.pkl — trained Random Forest regression model
feature_columns.json — ordered list of all input features the model expects

Both files must be used together for inference.

🧠 Model Description

Task: Regression
Target: EXT_SOURCE_2 (external credit score proxy)
Model: RandomForestRegressor

Why Predict EXT_SOURCE_2?

EXT_SOURCE_2 is a powerful creditworthiness indicator in the Home Credit dataset.
However, it contains many missing values, and being able to predict it from other borrower features helps improve downstream models.

🏗️ Feature Engineering Summary

The model was trained on a heavily engineered dataset including:

✔ Financial Features

Total income
Credit amount
Annuity
Credit/income ratio
Annuity/income ratio

✔ Demographic Features

Age
Employment duration
Family size
Dependents

✔ Stability Indicators

Employment stability
Income consistency
Relative wealth vs. region

✔ Missing Value Indicators

MISSING_COUNT
Boolean missing indicators for key features

✔ Document Indicators

DOCUMENT_COUNT (how many documents applicant supplied)

✔ Cluster Features (K-Means)

CLUSTER_ID
CLUSTER_DIST

These help the model capture non-linear borrower profiles that ordinary linear regression cannot.

⚙️ Training Configuration

RandomForestRegressor(
    n_estimators=80,
    max_depth=12,
    min_samples_leaf=50,
    n_jobs=-1,
    random_state=42
)

Why these hyperparameters?

max_depth=12 → prevents overfitting
min_samples_leaf=50 → forces smoothing and stability
n_estimators=80 → balances accuracy and training speed
n_jobs=-1 → uses all CPU cores

Model trained on 50,000 sampled rows for fast and robust performance.

📊 Model Performance (Held-Out Test Set)

Metric	Score
MAE	~0.138
MSE	~0.0287
RMSE	~0.170
R²	~0.21

Interpretation

An R² ≈ 0.21 means the model captures some meaningful signal but the dataset contains a lot of noise.

Performance is much stronger than a linear model, which scored ~0.18 R².

The model benefits significantly from:

Ratio features

Missing-value patterns

K-Means cluster structure

Why this model won

✔ Outperformed Linear Regression on all metrics ✔ Handles non-linear interactions between income, credit, stability, and documents ✔ Much less sensitive to noise and missingness ✔ Best trade-off between accuracy, generalization, and speed

⬇️ How to Download the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="guyshilo12/guyshilo-loan-default-regression-model",
    filename="regression_winner.pkl"
)

cols_path = hf_hub_download(
    repo_id="guyshilo12/guyshilo-loan-default-regression-model",
    filename="feature_columns.json"
)

📦 How to Load the Model

import pickle, json

# Load model
with open("regression_winner.pkl", "rb") as f:
    model = pickle.load(f)

# Load feature list
with open("feature_columns.json", "r") as f:
    feature_cols = json.load(f)

🔮 How to Run a Prediction

import pandas as pd

# Example input (replace zeros with real values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])

pred = model.predict(X)[0]

print("Predicted EXT_SOURCE_2:", pred)

🏆 Intended Use

This model is ideal for:

Academic ML projects

Feature engineering demonstrations

Credit-risk modeling research

Improving missing-value imputation strategies

Not intended for production lending decisions without regulatory review.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support