# Baseline Model Documentation ## 1. Pipeline Overview The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline. ### Preprocessing - **Dropped Columns:** `ID`, `Customer_ID`, `Name`, `SSN`, `Credit_Score` (Target). - **Data Cleaning:** - **Numeric Parsing:** Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed `_`, `,` and handled empty strings). - **Credit_History_Age:** Parsed "X Years Y Months" into total months. - **Imputation & Scaling (Numeric):** - `SimpleImputer(strategy='median')` - `StandardScaler()` - **Encoding (Categorical):** - `SimpleImputer(strategy='most_frequent')` - `OneHotEncoder(handle_unknown='ignore')` - Target (`Credit_Score`): Label Encoded. ### Model - **Algorithm:** Logistic Regression - **Parameters:** `max_iter=1000`, `class_weight='balanced'`, `random_state=42` - **Validation:** Stratified K-Fold Cross-Validation (5 Splits). ## 2. Performance Results | Metric | Score | | :--- | :--- | | **Mean Accuracy** | **0.7211** (+/- 0.0020) | | **Mean ROC-AUC** | **0.8647** | ### Fold-by-Fold Breakdown | Fold | Accuracy | ROC-AUC | | :--- | :--- | :--- | | Fold 1 | 0.7228 | 0.8660 | | Fold 2 | 0.7238 | 0.8647 | | Fold 3 | 0.7180 | 0.8642 | | Fold 4 | 0.7202 | 0.8635 | | Fold 5 | 0.7204 | 0.8648 | ## 3. Key Findings - **Significant Improvement:** Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline. - **Robustness:** Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well. - **Strong Discrimination:** ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model. - **Remaining Gap:** The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting). - **Next Steps:** Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.