File size: 5,186 Bytes
95409ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Feature Engineering Results

## 1. Implemented Strategy
We implemented a comprehensive feature engineering pipeline with customer-level aggregation:

### Data Cleaning & Type Conversion
-   **Numeric Parsing:** Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed underscores, commas, and handled invalid values).

### Feature Extraction
-   **Credit History Age:** Converted from "X Years Y Months" format to total months.
-   **Loan Features:**
    -   `Loan_Count_Calculated`: Count of different loan types.
    -   `Loan_<Type>`: One-Hot encoded top 8 loan types (Auto, Credit-Builder, Personal, Home Equity, Mortgage, Student, Debt Consolidation, Payday).
-   **Financial Ratios:**
    -   `Debt_to_Income_Ratio`: Outstanding Debt / Annual Income (financial risk metric).
    -   `Debt_Per_Loan`: Outstanding Debt / Loan Count.
    -   `Installment_to_Income`: Monthly EMI / Monthly Salary (debt service capacity).
    -   `Delayed_Per_Loan`: Num of Delayed Payments / Loan Count (payment reliability).
-   **Interaction Features:**
    -   `DTI_x_LoanCount`: Debt-to-Income × Loan Count (combined risk).
    -   `Log_Annual_Income`: Log-transformed income (handles skewness).

### Imputation & Aggregation
-   **Grouped Imputation:** Median salary imputation grouped by Occupation (more accurate than global median).
-   **Customer-Level Aggregation:** Reduced 150,000 monthly rows to 25,000 unique customers:
    -   **Stable fields** (Age, loan flags): First value.
    -   **Monthly-changing fields** (Income, Balance, EMI): Mean.
    -   **Count fields** (Delayed payments, inquiries): Sum.
    -   **Categorical fields** (Payment Behaviour, Credit Mix): Mode.

### Encoding & Scaling
-   **Ordinal Encoding:** Credit_Mix (Bad=0, Standard=1, Good=2).

-   **One-Hot Encoding:** Occupation, Payment_Behaviour, Month.
-   **Label Encoding:** Target (Credit_Score).

-   **No Global Scaling:** Features remain unscaled to preserve tree model performance (trees are invariant to feature scaling).



## 2. Model Performance Comparison



| Model | Dataset | Accuracy | Notes |

| :--- | :--- | :--- | :--- |

| **Baseline** (Simple Logistic Regression) | 5-Fold CV | **0.7211** | Strong linear baseline |

| **Logistic Regression** (with feature engineering + scaling) | Validation Split | **0.6544** | Linear model struggles with complex features |

| **Random Forest** (hyperparameter tuned) | Validation Split | **0.7340** | ✅ **Exceeds baseline by 1.3%** |



### Random Forest Class-Wise Performance



| Class | Precision | Recall | F1-Score | Support |

| :--- | :--- | :--- | :--- | :--- |

| Poor (0) | 0.59 | 0.84 | 0.69 | 501 |

| Standard (1) | 0.73 | 0.81 | 0.77 | 832 |

| Good (2) | 0.86 | 0.63 | 0.73 | 1,167 |

| **Weighted Avg** | **0.76** | **0.73** | **0.73** | **2,500** |



## 3. Key Insights



### Why Logistic Regression Performance Dropped

1. **Non-linear Feature Interactions:** Engineered features (DTI × LoanCount, Debt_Per_Loan) capture non-linear relationships that linear models cannot leverage.

2. **Dimensionality Curse:** One-Hot encoding of multiple categorical features (Occupation, Payment_Behaviour) increased feature space without linear model regularization.
3. **Information Loss:** Dropping `Annual_Income` in favor of `Log_Annual_Income` may have removed linear signal if the true relationship isn't purely logarithmic.

### Why Random Forest Excels
1. **Non-linear Decision Boundaries:** Trees naturally capture feature interactions without explicit engineering.
2. **High Recall on Poor Scores:** 84% recall on class 0 (Poor) is critical for risk management—catches risky customers.
3. **Balanced Performance:** Weighted F1-score of 0.73 shows good generalization across all credit score classes.
4. **Robustness:** Hyperparameters (max_depth=10, balanced_class_weight) prevent overfitting while leveraging complex features.



## 4. Recommendations for Next Phase (04_model_optimization.ipynb)



**Keep the engineered features** — They provide valuable signal for non-linear models.

**Continue with tree-based models** — Random Forest, XGBoost will unlock feature complexity better than linear models.

**Perform feature importance analysis** — Identify which engineered features drive predictions.

**Cross-validate with stratified k-fold** — Ensure 73.4% accuracy is stable across data splits.

**Compare with XGBoost** — Gradient boosting may outperform bagging approaches.

**Class-wise optimization** — Focus on improving recall for "Poor" customers (high-risk detection).



## 5. Data Quality Improvements Made

-   ✅ Handled missing values in `Monthly_Inhand_Salary`, `Type_of_Loan`, `Credit_History_Age`.

-   ✅ Cleaned numeric columns with special characters (underscores, commas).

-   ✅ Removed outliers in `Num_of_Delayed_Payment` (clipped at 99th percentile).
-   ✅ Aggregated to customer level to prevent temporal leakage and reduce noise.
-   ✅ Verified no NaN values remain before model training.