Spaces:
Sleeping
Sleeping
Baseline Model Documentation
1. Pipeline Overview
The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline.
Preprocessing
- Dropped Columns:
ID,Customer_ID,Name,SSN,Credit_Score(Target). - Data Cleaning:
- Numeric Parsing: Cleaned
Age,Annual_Income,Outstanding_Debt,Num_of_Delayed_Payment,Num_of_Loan,Amount_invested_monthly,Monthly_Balance,Changed_Credit_Limit(removed_,,and handled empty strings). - Credit_History_Age: Parsed "X Years Y Months" into total months.
- Numeric Parsing: Cleaned
- Imputation & Scaling (Numeric):
SimpleImputer(strategy='median')StandardScaler()
- Encoding (Categorical):
SimpleImputer(strategy='most_frequent')OneHotEncoder(handle_unknown='ignore')- Target (
Credit_Score): Label Encoded.
Model
- Algorithm: Logistic Regression
- Parameters:
max_iter=1000,class_weight='balanced',random_state=42 - Validation: Stratified K-Fold Cross-Validation (5 Splits).
2. Performance Results
| Metric | Score |
|---|---|
| Mean Accuracy | 0.7211 (+/- 0.0020) |
| Mean ROC-AUC | 0.8647 |
Fold-by-Fold Breakdown
| Fold | Accuracy | ROC-AUC |
|---|---|---|
| Fold 1 | 0.7228 | 0.8660 |
| Fold 2 | 0.7238 | 0.8647 |
| Fold 3 | 0.7180 | 0.8642 |
| Fold 4 | 0.7202 | 0.8635 |
| Fold 5 | 0.7204 | 0.8648 |
3. Key Findings
- Significant Improvement: Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline.
- Robustness: Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well.
- Strong Discrimination: ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model.
- Remaining Gap: The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting).
- Next Steps: Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.