Instructions to use shak9345/loan-interest-rate-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use shak9345/loan-interest-rate-predictor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("shak9345/loan-interest-rate-predictor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
Lending Club Loan Analysis: Predictive Modeling and Risk Assessment
Data Science Project - Assignment 2 | Reichman University
Author: Shaked Rokach
1. Project Overview & Business Logic
This project presents an extensive end-to-end data science pipeline applied to a massive financial dataset from Lending Club, covering over 890,000 loan records (2007-2015). The core objective is to build a sophisticated intelligence engine for Risk-Based Pricing.
In the banking sector, the ability to accurately predict interest rates and classify risk determines the profitability and stability of the institution. This project addresses this by developing two interconnected systems:
- Regression Layer: Predicting the precise interest rate based on complex borrower profiles.
- Classification Layer: Implementing a binary decision-making tool to identify "High Risk" vs. "Low Risk" applicants.
The analysis moves beyond basic modeling by integrating Unsupervised Learning (Clustering) to uncover latent borrower segments, which are then used as high-signal features for supervised models.
2. Research Objective
To what extent can domain-driven feature engineering and borrower segmentation improve the precision of loan pricing and risk classification in a non-linear financial environment?
3. Data Integrity & Preprocessing Pipelines
Handling nearly a million records required a "Zero-Error" cleaning and preprocessing strategy. To ensure reproducibility and prevent Data Leakage, I utilized Scikit-Learn Pipelines and ColumnTransformers.
Preprocessing Workflow:
- Numerical Features: Handled via Median Imputation to ensure robustness against financial outliers.
- Categorical Features: Applied One-Hot Encoding to variables such as home ownership and verification status to transform them into a format suitable for machine learning algorithms.
- Scaling: Features were standardized using StandardScaler to ensure that the scale of variables (e.g., annual income vs. loan amount) does not bias the model's coefficients.
- The "Outlier" Philosophy: I made a strategic decision to retain high-income outliers. In credit markets, high-earners represent a legitimate and high-value segment; removing them would lead to an under-fitted model that fails to recognize the behaviors of prime customers.
Correlation Heatmap Analysis
The Heatmap visualizes the Pearson correlation coefficients between numerical variables.
- Purpose: This analysis was critical for identifying multicollinearity and determining the primary signals for interest rate determination.
- Key Insight: While the institutional grade is the strongest anchor, the analysis revealed that its interaction with debt-to-income (DTI) ratios is non-linear, justifying the move to ensemble architectures.
4. Feature Engineering & Unsupervised Learning
Domain-specific feature engineering was applied to simulate the analytical process of a human credit analyst.
The Elbow Method: Scientific Validation of K
To segment the borrower population, I implemented K-Means clustering. The Elbow Plot tracks "Inertia" (the sum of squared distances within clusters).
- Result: The analysis scientifically validated K=3 as the optimal number of clusters, providing maximum descriptive power without over-fitting.
PCA: Dimensionality Reduction & Cluster Verification
Since high-dimensional clusters are impossible to visualize directly, I utilized Principal Component Analysis (PCA) to project the data into a 2D space.
- Cluster Personas:
- Cluster 0 (Prime Borrowers): High income, long credit history, and low debt-to-income ratios.
- Cluster 1 (High Leverage): Significant existing debt levels relative to income.
- Cluster 2 (Emerging Credit): Younger credit histories with moderate financial stability.
5. Modeling Results & Performance Evaluation
A. Regression Task (Predicting Interest Rates)
Feature Importance & Performance Analysis
- Winning Model: Random Forest Regressor.
- Performance: Achieved an exceptional R-squared (R2) of 0.99.
- Analysis: This score indicates that the model successfully decoded the bank's internal pricing logic. The Feature Importance plot highlights that our engineered features—specifically the Grade-DTI interaction—were the primary predictors.
B. Classification Task (Risk Assessment)
I conducted an iterative comparison between three powerful architectures to identify the most reliable risk-detection tool.
Results Comparison Table:
| Model | Accuracy | Recall (Risk) | ROC-AUC |
|---|---|---|---|
| Logistic Regression | 89.9% | 83.1% | 0.88 |
| Random Forest (Winner) | 89.4% | 85.1% | 0.97 |
| Gradient Boosting | 90.8% | 83.4% | 0.95 |
Strategic Evaluation of Confusion Matrices
The Confusion Matrix allows for a detailed error analysis. In financial risk, the cost of a False Negative (failing to identify a high-risk borrower) is exponentially higher than a False Positive. Therefore, the Random Forest model was prioritized for its superior Recall (85.1%) and ability to minimize institutional risk.
6. Conclusion
The project confirms that while institutional grades are powerful, Feature Engineering and Clustering provide the critical edge. The jump from a 0.91 baseline to a 0.99 R2 showcases the value of transforming raw data into domain-specific intelligence. By identifying latent borrower personas, we achieved a more nuanced and accurate understanding of credit risk.
- Downloads last month
- -






