Lending Club Loan Analysis: Predictive Modeling and Risk Assessment

Data Science Project - Assignment 2 | Reichman University

Author: Shaked Rokach


1. Project Overview & Business Logic

This project presents an extensive end-to-end data science pipeline applied to a massive financial dataset from Lending Club, covering over 890,000 loan records (2007-2015). The core objective is to build a sophisticated intelligence engine for Risk-Based Pricing.

In the banking sector, the ability to accurately predict interest rates and classify risk determines the profitability and stability of the institution. This project addresses this by developing two interconnected systems:

  • Regression Layer: Predicting the precise interest rate based on complex borrower profiles.
  • Classification Layer: Implementing a binary decision-making tool to identify "High Risk" vs. "Low Risk" applicants.

The analysis moves beyond basic modeling by integrating Unsupervised Learning (Clustering) to uncover latent borrower segments, which are then used as high-signal features for supervised models.

2. Research Objective

To what extent can domain-driven feature engineering and borrower segmentation improve the precision of loan pricing and risk classification in a non-linear financial environment?


3. Data Integrity & Preprocessing Pipelines

Handling nearly a million records required a "Zero-Error" cleaning and preprocessing strategy. To ensure reproducibility and prevent Data Leakage, I utilized Scikit-Learn Pipelines and ColumnTransformers.

Preprocessing Workflow:

  • Numerical Features: Handled via Median Imputation to ensure robustness against financial outliers.
  • Categorical Features: Applied One-Hot Encoding to variables such as home ownership and verification status to transform them into a format suitable for machine learning algorithms.
  • Scaling: Features were standardized using StandardScaler to ensure that the scale of variables (e.g., annual income vs. loan amount) does not bias the model's coefficients.
  • The "Outlier" Philosophy: I made a strategic decision to retain high-income outliers. In credit markets, high-earners represent a legitimate and high-value segment; removing them would lead to an under-fitted model that fails to recognize the behaviors of prime customers.

Heatmap

Correlation Heatmap Analysis

The Heatmap visualizes the Pearson correlation coefficients between numerical variables.

  • Purpose: This analysis was critical for identifying multicollinearity and determining the primary signals for interest rate determination.
  • Key Insight: While the institutional grade is the strongest anchor, the analysis revealed that its interaction with debt-to-income (DTI) ratios is non-linear, justifying the move to ensemble architectures.

4. Feature Engineering & Unsupervised Learning

Domain-specific feature engineering was applied to simulate the analytical process of a human credit analyst.

Elbow Plot

The Elbow Method: Scientific Validation of K

To segment the borrower population, I implemented K-Means clustering. The Elbow Plot tracks "Inertia" (the sum of squared distances within clusters).

  • Result: The analysis scientifically validated K=3 as the optimal number of clusters, providing maximum descriptive power without over-fitting.

PCA

PCA: Dimensionality Reduction & Cluster Verification

Since high-dimensional clusters are impossible to visualize directly, I utilized Principal Component Analysis (PCA) to project the data into a 2D space.

  • Cluster Personas:
    • Cluster 0 (Prime Borrowers): High income, long credit history, and low debt-to-income ratios.
    • Cluster 1 (High Leverage): Significant existing debt levels relative to income.
    • Cluster 2 (Emerging Credit): Younger credit histories with moderate financial stability.

5. Modeling Results & Performance Evaluation

A. Regression Task (Predicting Interest Rates)

Feature Importance

Feature Importance & Performance Analysis

  • Winning Model: Random Forest Regressor.
  • Performance: Achieved an exceptional R-squared (R2) of 0.99.
  • Analysis: This score indicates that the model successfully decoded the bank's internal pricing logic. The Feature Importance plot highlights that our engineered features—specifically the Grade-DTI interaction—were the primary predictors.

B. Classification Task (Risk Assessment)

I conducted an iterative comparison between three powerful architectures to identify the most reliable risk-detection tool.

Results Comparison Table:

Model Accuracy Recall (Risk) ROC-AUC
Logistic Regression 89.9% 83.1% 0.88
Random Forest (Winner) 89.4% 85.1% 0.97
Gradient Boosting 90.8% 83.4% 0.95

Confusion Matrix 1 Confusion Matrix 2 Confusion Matrix 3

Strategic Evaluation of Confusion Matrices

The Confusion Matrix allows for a detailed error analysis. In financial risk, the cost of a False Negative (failing to identify a high-risk borrower) is exponentially higher than a False Positive. Therefore, the Random Forest model was prioritized for its superior Recall (85.1%) and ability to minimize institutional risk.


6. Conclusion

The project confirms that while institutional grades are powerful, Feature Engineering and Clustering provide the critical edge. The jump from a 0.91 baseline to a 0.99 R2 showcases the value of transforming raw data into domain-specific intelligence. By identifying latent borrower personas, we achieved a more nuanced and accurate understanding of credit risk.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support