FinRisk-AI / docs /03_feature_engineering.md
iremrit's picture
Upload 36 files
95409ed verified
# Feature Engineering Results
## 1. Implemented Strategy
We implemented a comprehensive feature engineering pipeline with customer-level aggregation:
### Data Cleaning & Type Conversion
- **Numeric Parsing:** Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed underscores, commas, and handled invalid values).
### Feature Extraction
- **Credit History Age:** Converted from "X Years Y Months" format to total months.
- **Loan Features:**
- `Loan_Count_Calculated`: Count of different loan types.
- `Loan_<Type>`: One-Hot encoded top 8 loan types (Auto, Credit-Builder, Personal, Home Equity, Mortgage, Student, Debt Consolidation, Payday).
- **Financial Ratios:**
- `Debt_to_Income_Ratio`: Outstanding Debt / Annual Income (financial risk metric).
- `Debt_Per_Loan`: Outstanding Debt / Loan Count.
- `Installment_to_Income`: Monthly EMI / Monthly Salary (debt service capacity).
- `Delayed_Per_Loan`: Num of Delayed Payments / Loan Count (payment reliability).
- **Interaction Features:**
- `DTI_x_LoanCount`: Debt-to-Income × Loan Count (combined risk).
- `Log_Annual_Income`: Log-transformed income (handles skewness).
### Imputation & Aggregation
- **Grouped Imputation:** Median salary imputation grouped by Occupation (more accurate than global median).
- **Customer-Level Aggregation:** Reduced 150,000 monthly rows to 25,000 unique customers:
- **Stable fields** (Age, loan flags): First value.
- **Monthly-changing fields** (Income, Balance, EMI): Mean.
- **Count fields** (Delayed payments, inquiries): Sum.
- **Categorical fields** (Payment Behaviour, Credit Mix): Mode.
### Encoding & Scaling
- **Ordinal Encoding:** Credit_Mix (Bad=0, Standard=1, Good=2).
- **One-Hot Encoding:** Occupation, Payment_Behaviour, Month.
- **Label Encoding:** Target (Credit_Score).
- **No Global Scaling:** Features remain unscaled to preserve tree model performance (trees are invariant to feature scaling).
## 2. Model Performance Comparison
| Model | Dataset | Accuracy | Notes |
| :--- | :--- | :--- | :--- |
| **Baseline** (Simple Logistic Regression) | 5-Fold CV | **0.7211** | Strong linear baseline |
| **Logistic Regression** (with feature engineering + scaling) | Validation Split | **0.6544** | Linear model struggles with complex features |
| **Random Forest** (hyperparameter tuned) | Validation Split | **0.7340** | ✅ **Exceeds baseline by 1.3%** |
### Random Forest Class-Wise Performance
| Class | Precision | Recall | F1-Score | Support |
| :--- | :--- | :--- | :--- | :--- |
| Poor (0) | 0.59 | 0.84 | 0.69 | 501 |
| Standard (1) | 0.73 | 0.81 | 0.77 | 832 |
| Good (2) | 0.86 | 0.63 | 0.73 | 1,167 |
| **Weighted Avg** | **0.76** | **0.73** | **0.73** | **2,500** |
## 3. Key Insights
### Why Logistic Regression Performance Dropped
1. **Non-linear Feature Interactions:** Engineered features (DTI × LoanCount, Debt_Per_Loan) capture non-linear relationships that linear models cannot leverage.
2. **Dimensionality Curse:** One-Hot encoding of multiple categorical features (Occupation, Payment_Behaviour) increased feature space without linear model regularization.
3. **Information Loss:** Dropping `Annual_Income` in favor of `Log_Annual_Income` may have removed linear signal if the true relationship isn't purely logarithmic.
### Why Random Forest Excels
1. **Non-linear Decision Boundaries:** Trees naturally capture feature interactions without explicit engineering.
2. **High Recall on Poor Scores:** 84% recall on class 0 (Poor) is critical for risk management—catches risky customers.
3. **Balanced Performance:** Weighted F1-score of 0.73 shows good generalization across all credit score classes.
4. **Robustness:** Hyperparameters (max_depth=10, balanced_class_weight) prevent overfitting while leveraging complex features.
## 4. Recommendations for Next Phase (04_model_optimization.ipynb)
**Keep the engineered features** — They provide valuable signal for non-linear models.
**Continue with tree-based models** — Random Forest, XGBoost will unlock feature complexity better than linear models.
**Perform feature importance analysis** — Identify which engineered features drive predictions.
**Cross-validate with stratified k-fold** — Ensure 73.4% accuracy is stable across data splits.
**Compare with XGBoost** — Gradient boosting may outperform bagging approaches.
**Class-wise optimization** — Focus on improving recall for "Poor" customers (high-risk detection).
## 5. Data Quality Improvements Made
- ✅ Handled missing values in `Monthly_Inhand_Salary`, `Type_of_Loan`, `Credit_History_Age`.
- ✅ Cleaned numeric columns with special characters (underscores, commas).
- ✅ Removed outliers in `Num_of_Delayed_Payment` (clipped at 99th percentile).
- ✅ Aggregated to customer level to prevent temporal leakage and reduce noise.
- ✅ Verified no NaN values remain before model training.