FinRisk-AI / docs /02_baseline.md
iremrit's picture
Upload 36 files
95409ed verified
# Baseline Model Documentation
## 1. Pipeline Overview
The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline.
### Preprocessing
- **Dropped Columns:** `ID`, `Customer_ID`, `Name`, `SSN`, `Credit_Score` (Target).
- **Data Cleaning:**
- **Numeric Parsing:** Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed `_`, `,` and handled empty strings).
- **Credit_History_Age:** Parsed "X Years Y Months" into total months.
- **Imputation & Scaling (Numeric):**
- `SimpleImputer(strategy='median')`
- `StandardScaler()`
- **Encoding (Categorical):**
- `SimpleImputer(strategy='most_frequent')`
- `OneHotEncoder(handle_unknown='ignore')`
- Target (`Credit_Score`): Label Encoded.
### Model
- **Algorithm:** Logistic Regression
- **Parameters:** `max_iter=1000`, `class_weight='balanced'`, `random_state=42`
- **Validation:** Stratified K-Fold Cross-Validation (5 Splits).
## 2. Performance Results
| Metric | Score |
| :--- | :--- |
| **Mean Accuracy** | **0.7211** (+/- 0.0020) |
| **Mean ROC-AUC** | **0.8647** |
### Fold-by-Fold Breakdown
| Fold | Accuracy | ROC-AUC |
| :--- | :--- | :--- |
| Fold 1 | 0.7228 | 0.8660 |
| Fold 2 | 0.7238 | 0.8647 |
| Fold 3 | 0.7180 | 0.8642 |
| Fold 4 | 0.7202 | 0.8635 |
| Fold 5 | 0.7204 | 0.8648 |
## 3. Key Findings
- **Significant Improvement:** Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline.
- **Robustness:** Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well.
- **Strong Discrimination:** ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model.
- **Remaining Gap:** The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting).
- **Next Steps:** Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.