Home Credit โ€“ EXT_SOURCE_2 Regression Model (Random Forest)

This repository contains the winning regression model trained to predict the external credit score (EXT_SOURCE_2) in the Home Credit Default Risk dataset.
The model uses extensive feature engineering and captures non-linear financial and demographic patterns using a Random Forest Regressor.


๐Ÿ“ Files Included

  • regression_winner.pkl โ€” trained Random Forest regression model
  • feature_columns.json โ€” ordered list of all input features the model expects

Both files must be used together for inference.


๐Ÿง  Model Description

Task: Regression
Target: EXT_SOURCE_2 (external credit score proxy)
Model: RandomForestRegressor

Why Predict EXT_SOURCE_2?

EXT_SOURCE_2 is a powerful creditworthiness indicator in the Home Credit dataset.
However, it contains many missing values, and being able to predict it from other borrower features helps improve downstream models.


๐Ÿ—๏ธ Feature Engineering Summary

The model was trained on a heavily engineered dataset including:

โœ” Financial Features

  • Total income
  • Credit amount
  • Annuity
  • Credit/income ratio
  • Annuity/income ratio

โœ” Demographic Features

  • Age
  • Employment duration
  • Family size
  • Dependents

โœ” Stability Indicators

  • Employment stability
  • Income consistency
  • Relative wealth vs. region

โœ” Missing Value Indicators

  • MISSING_COUNT
  • Boolean missing indicators for key features

โœ” Document Indicators

  • DOCUMENT_COUNT (how many documents applicant supplied)

โœ” Cluster Features (K-Means)

  • CLUSTER_ID
  • CLUSTER_DIST

These help the model capture non-linear borrower profiles that ordinary linear regression cannot.


โš™๏ธ Training Configuration

RandomForestRegressor(
    n_estimators=80,
    max_depth=12,
    min_samples_leaf=50,
    n_jobs=-1,
    random_state=42
)

Why these hyperparameters?

  • max_depth=12 โ†’ prevents overfitting

  • min_samples_leaf=50 โ†’ forces smoothing and stability

  • n_estimators=80 โ†’ balances accuracy and training speed

  • n_jobs=-1 โ†’ uses all CPU cores

Model trained on 50,000 sampled rows for fast and robust performance.


๐Ÿ“Š Model Performance (Held-Out Test Set)

Metric Score
MAE ~0.138
MSE ~0.0287
RMSE ~0.170
Rยฒ ~0.21

Interpretation

An Rยฒ โ‰ˆ 0.21 means the model captures some meaningful signal but the dataset contains a lot of noise.

Performance is much stronger than a linear model, which scored ~0.18 Rยฒ.

The model benefits significantly from:

Ratio features

Missing-value patterns

K-Means cluster structure


Why this model won

โœ” Outperformed Linear Regression on all metrics โœ” Handles non-linear interactions between income, credit, stability, and documents โœ” Much less sensitive to noise and missingness โœ” Best trade-off between accuracy, generalization, and speed


โฌ‡๏ธ How to Download the Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="guyshilo12/guyshilo-loan-default-regression-model",
    filename="regression_winner.pkl"
)

cols_path = hf_hub_download(
    repo_id="guyshilo12/guyshilo-loan-default-regression-model",
    filename="feature_columns.json"
)

๐Ÿ“ฆ How to Load the Model

import pickle, json

# Load model
with open("regression_winner.pkl", "rb") as f:
    model = pickle.load(f)

# Load feature list
with open("feature_columns.json", "r") as f:
    feature_cols = json.load(f)

๐Ÿ”ฎ How to Run a Prediction

import pandas as pd

# Example input (replace zeros with real values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])

pred = model.predict(X)[0]

print("Predicted EXT_SOURCE_2:", pred)

๐Ÿ† Intended Use

This model is ideal for:

Academic ML projects

Feature engineering demonstrations

Credit-risk modeling research

Improving missing-value imputation strategies

Not intended for production lending decisions without regulatory review.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support