Home Credit โ EXT_SOURCE_2 Regression Model (Random Forest)
This repository contains the winning regression model trained to predict the external credit score (EXT_SOURCE_2) in the Home Credit Default Risk dataset.
The model uses extensive feature engineering and captures non-linear financial and demographic patterns using a Random Forest Regressor.
๐ Files Included
- regression_winner.pkl โ trained Random Forest regression model
- feature_columns.json โ ordered list of all input features the model expects
Both files must be used together for inference.
๐ง Model Description
Task: Regression
Target: EXT_SOURCE_2 (external credit score proxy)
Model: RandomForestRegressor
Why Predict EXT_SOURCE_2?
EXT_SOURCE_2 is a powerful creditworthiness indicator in the Home Credit dataset.
However, it contains many missing values, and being able to predict it from other borrower features helps improve downstream models.
๐๏ธ Feature Engineering Summary
The model was trained on a heavily engineered dataset including:
โ Financial Features
- Total income
- Credit amount
- Annuity
- Credit/income ratio
- Annuity/income ratio
โ Demographic Features
- Age
- Employment duration
- Family size
- Dependents
โ Stability Indicators
- Employment stability
- Income consistency
- Relative wealth vs. region
โ Missing Value Indicators
MISSING_COUNT- Boolean missing indicators for key features
โ Document Indicators
DOCUMENT_COUNT(how many documents applicant supplied)
โ Cluster Features (K-Means)
CLUSTER_IDCLUSTER_DIST
These help the model capture non-linear borrower profiles that ordinary linear regression cannot.
โ๏ธ Training Configuration
RandomForestRegressor(
n_estimators=80,
max_depth=12,
min_samples_leaf=50,
n_jobs=-1,
random_state=42
)
Why these hyperparameters?
max_depth=12 โ prevents overfitting
min_samples_leaf=50 โ forces smoothing and stability
n_estimators=80 โ balances accuracy and training speed
n_jobs=-1 โ uses all CPU cores
Model trained on 50,000 sampled rows for fast and robust performance.
๐ Model Performance (Held-Out Test Set)
| Metric | Score |
|---|---|
| MAE | ~0.138 |
| MSE | ~0.0287 |
| RMSE | ~0.170 |
| Rยฒ | ~0.21 |
Interpretation
An Rยฒ โ 0.21 means the model captures some meaningful signal but the dataset contains a lot of noise.
Performance is much stronger than a linear model, which scored ~0.18 Rยฒ.
The model benefits significantly from:
Ratio features
Missing-value patterns
K-Means cluster structure
Why this model won
โ Outperformed Linear Regression on all metrics โ Handles non-linear interactions between income, credit, stability, and documents โ Much less sensitive to noise and missingness โ Best trade-off between accuracy, generalization, and speed
โฌ๏ธ How to Download the Model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="guyshilo12/guyshilo-loan-default-regression-model",
filename="regression_winner.pkl"
)
cols_path = hf_hub_download(
repo_id="guyshilo12/guyshilo-loan-default-regression-model",
filename="feature_columns.json"
)
๐ฆ How to Load the Model
import pickle, json
# Load model
with open("regression_winner.pkl", "rb") as f:
model = pickle.load(f)
# Load feature list
with open("feature_columns.json", "r") as f:
feature_cols = json.load(f)
๐ฎ How to Run a Prediction
import pandas as pd
# Example input (replace zeros with real values)
X = pd.DataFrame([{col: 0 for col in feature_cols}])
pred = model.predict(X)[0]
print("Predicted EXT_SOURCE_2:", pred)
๐ Intended Use
This model is ideal for:
Academic ML projects
Feature engineering demonstrations
Credit-risk modeling research
Improving missing-value imputation strategies
Not intended for production lending decisions without regulatory review.