# Baseline Model Documentation

## 1. Pipeline Overview
The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline.

### Preprocessing
-   **Dropped Columns:** `ID`, `Customer_ID`, `Name`, `SSN`, `Credit_Score` (Target).
-   **Data Cleaning:**
    -   **Numeric Parsing:** Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed `_`, `,` and handled empty strings).
    -   **Credit_History_Age:** Parsed "X Years Y Months" into total months.
-   **Imputation & Scaling (Numeric):**
    -   `SimpleImputer(strategy='median')`
    -   `StandardScaler()`
-   **Encoding (Categorical):**
    -   `SimpleImputer(strategy='most_frequent')`
    -   `OneHotEncoder(handle_unknown='ignore')`
    -   Target (`Credit_Score`): Label Encoded.

### Model
-   **Algorithm:** Logistic Regression
-   **Parameters:** `max_iter=1000`, `class_weight='balanced'`, `random_state=42`
-   **Validation:** Stratified K-Fold Cross-Validation (5 Splits).

## 2. Performance Results

| Metric | Score |
| :--- | :--- |
| **Mean Accuracy** | **0.7211** (+/- 0.0020) |
| **Mean ROC-AUC** | **0.8647** |

### Fold-by-Fold Breakdown

| Fold | Accuracy | ROC-AUC |
| :--- | :--- | :--- |
| Fold 1 | 0.7228 | 0.8660 |
| Fold 2 | 0.7238 | 0.8647 |
| Fold 3 | 0.7180 | 0.8642 |
| Fold 4 | 0.7202 | 0.8635 |
| Fold 5 | 0.7204 | 0.8648 |

## 3. Key Findings
-   **Significant Improvement:** Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline.
-   **Robustness:** Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well.
-   **Strong Discrimination:** ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model.
-   **Remaining Gap:** The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting).
-   **Next Steps:** Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.