| | --- |
| | license: mit |
| | language: |
| | - en |
| | - ru |
| | pipeline_tag: tabular-classification |
| | tags: |
| | - credit-scoring |
| | - catboost |
| | - lightgbm |
| | - polars |
| | - tabular |
| | - binary-classification |
| | metrics: |
| | - roc_auc |
| | --- |
| | |
| | Credit Risk Prediction Model |
| |
|
| | Description |
| |
|
| | Machine learning model for predicting bank client defaults. This model uses an ensemble of CatBoost and LightGBM with advanced feature engineering to assess credit risk. |
| |
|
| | Business Context |
| |
|
| | Development of a high-performance credit risk assessment system for the banking sector. The primary goal is to minimize bank losses by automating the prediction of client default probability. |
| |
|
| |
|
| | Model Performance |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | **ROC-AUC** | 0.7523 | |
| | | **Target KPI** | 0.75 | |
| | | **Status** | β
Achieved | |
| |
|
| |
|
| | Tech Stack |
| |
|
| | - **Language**: Python 3.10 |
| | - **Big Data Processing**: Polars (Lazy Loading) |
| | - **Machine Learning**: |
| | - CatBoost (weight: 0.05) |
| | - LightGBM (weight: 0.95) |
| | - **Infrastructure**: GPU acceleration (NVIDIA RTX 3050) |
| | - **Tools**: Scikit-learn, Scipy, Pandas, Matplotlib, Seaborn |
| |
|
| |
|
| | Dataset |
| |
|
| | - **Records**: 3,000,000 |
| | - **Files**: 12 Parquet files |
| | - **Size**: 4.5 GB |
| | - **Class Imbalance**: 1:49 (2% positive class) |
| |
|
| |
|
| | Key Features |
| |
|
| | Over 170 engineered features including: |
| | - `utilization_ratio` β credit limit usage level |
| | - `overdue_ratio` β share of overdue debt |
| | - `delays_per_loan` β frequency of critical delays (90+ days) |
| |
|
| |
|
| | Usage |
| |
|
| | Installation |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ```python |
| | import joblib |
| | import polars as pl |
| | |
| | # Load model |
| | model = joblib.load("final_pipeline.pkl") |
| | |
| | # Load data |
| | df = pl.read_parquet("client_data.parquet") |
| | |
| | # Make predictions |
| | predictions = model.predict(df) |
| | probabilities = model.predict_proba(df) |
| | |
| | # Results |
| | print(f"Default probability: {probabilities[:, 1]}") |
| | ``` |
| |
|
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | import joblib |
| | |
| | # Download model |
| | model_path = hf_hub_download( |
| | repo_id="maxdavinci/Credit_Risk_Prediction_Model_0.75", |
| | filename="final_pipeline.pkl" |
| | ) |
| | |
| | # Load and use |
| | model = joblib.load(model_path) |
| | ``` |
| |
|
| |
|
| | Engineering Solutions |
| |
|
| | Scalability: Polars for efficient Big Data processing |
| | Class Imbalance: Stratified validation + scale_pos_weight (27.18) |
| | Ensembling: Rank Averaging method for stability |
| | Production Ready: Custom CreditEnsemble class compatible with sklearn.pipeline |
| | |
| |
|
| | Project Structure |
| |
|
| | Credit_Risk_Prediction_Model_0.75/ |
| | βββ credit_risk_modeling.ipynb # Jupyter notebook with code |
| | βββ final_pipeline.pkl # Trained model (90 MB) |
| | βββ requirements.txt # Dependencies |
| | βββ README.md # This file |
| | |
| | |
| | Links |
| | |
| | GitHub Repository: https://github.com/maxdavinci2022/Credit_Risk_Prediction_Model_0.75 |
| | Author: @maxdavinci2022 |