File size: 2,743 Bytes
a4068e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: mit
language:
- en
- ru
pipeline_tag: tabular-classification
tags:
- credit-scoring
- catboost
- lightgbm
- polars
- tabular
- binary-classification
metrics:
- roc_auc
---

Credit Risk Prediction Model

Description

Machine learning model for predicting bank client defaults. This model uses an ensemble of CatBoost and LightGBM with advanced feature engineering to assess credit risk.

Business Context

Development of a high-performance credit risk assessment system for the banking sector. The primary goal is to minimize bank losses by automating the prediction of client default probability.


Model Performance

| Metric | Value |
|--------|-------|
| **ROC-AUC** | 0.7523 |
| **Target KPI** | 0.75 |
| **Status** | βœ… Achieved |


Tech Stack

- **Language**: Python 3.10
- **Big Data Processing**: Polars (Lazy Loading)
- **Machine Learning**: 
  - CatBoost (weight: 0.05)
  - LightGBM (weight: 0.95)
- **Infrastructure**: GPU acceleration (NVIDIA RTX 3050)
- **Tools**: Scikit-learn, Scipy, Pandas, Matplotlib, Seaborn


Dataset

- **Records**: 3,000,000
- **Files**: 12 Parquet files
- **Size**: 4.5 GB
- **Class Imbalance**: 1:49 (2% positive class)


Key Features

Over 170 engineered features including:
- `utilization_ratio` β€” credit limit usage level
- `overdue_ratio` β€” share of overdue debt
- `delays_per_loan` β€” frequency of critical delays (90+ days)


Usage

Installation

```bash
pip install -r requirements.txt
```

```python
import joblib
import polars as pl

# Load model
model = joblib.load("final_pipeline.pkl")

# Load data
df = pl.read_parquet("client_data.parquet")

# Make predictions
predictions = model.predict(df)
probabilities = model.predict_proba(df)

# Results
print(f"Default probability: {probabilities[:, 1]}")
```


```python
from huggingface_hub import hf_hub_download
import joblib

# Download model
model_path = hf_hub_download(
    repo_id="maxdavinci/Credit_Risk_Prediction_Model_0.75",
    filename="final_pipeline.pkl"
)

# Load and use
model = joblib.load(model_path)
```


Engineering Solutions

    Scalability: Polars for efficient Big Data processing
    Class Imbalance: Stratified validation + scale_pos_weight (27.18)
    Ensembling: Rank Averaging method for stability
    Production Ready: Custom CreditEnsemble class compatible with sklearn.pipeline


Project Structure

Credit_Risk_Prediction_Model_0.75/
β”œβ”€β”€ credit_risk_modeling.ipynb  # Jupyter notebook with code
β”œβ”€β”€ final_pipeline.pkl          # Trained model (90 MB)
β”œβ”€β”€ requirements.txt            # Dependencies
└── README.md                   # This file


Links

    GitHub Repository: https://github.com/maxdavinci2022/Credit_Risk_Prediction_Model_0.75
    Author: @maxdavinci2022