Spaces:
Sleeping
Sleeping
Feature Engineering Results
1. Implemented Strategy
We implemented a comprehensive feature engineering pipeline with customer-level aggregation:
Data Cleaning & Type Conversion
- Numeric Parsing: Cleaned
Age,Annual_Income,Outstanding_Debt,Num_of_Delayed_Payment,Num_of_Loan,Amount_invested_monthly,Monthly_Balance,Changed_Credit_Limit(removed underscores, commas, and handled invalid values).
Feature Extraction
- Credit History Age: Converted from "X Years Y Months" format to total months.
- Loan Features:
Loan_Count_Calculated: Count of different loan types.Loan_<Type>: One-Hot encoded top 8 loan types (Auto, Credit-Builder, Personal, Home Equity, Mortgage, Student, Debt Consolidation, Payday).
- Financial Ratios:
Debt_to_Income_Ratio: Outstanding Debt / Annual Income (financial risk metric).Debt_Per_Loan: Outstanding Debt / Loan Count.Installment_to_Income: Monthly EMI / Monthly Salary (debt service capacity).Delayed_Per_Loan: Num of Delayed Payments / Loan Count (payment reliability).
- Interaction Features:
DTI_x_LoanCount: Debt-to-Income × Loan Count (combined risk).Log_Annual_Income: Log-transformed income (handles skewness).
Imputation & Aggregation
- Grouped Imputation: Median salary imputation grouped by Occupation (more accurate than global median).
- Customer-Level Aggregation: Reduced 150,000 monthly rows to 25,000 unique customers:
- Stable fields (Age, loan flags): First value.
- Monthly-changing fields (Income, Balance, EMI): Mean.
- Count fields (Delayed payments, inquiries): Sum.
- Categorical fields (Payment Behaviour, Credit Mix): Mode.
Encoding & Scaling
- Ordinal Encoding: Credit_Mix (Bad=0, Standard=1, Good=2).
- One-Hot Encoding: Occupation, Payment_Behaviour, Month.
- Label Encoding: Target (Credit_Score).
- No Global Scaling: Features remain unscaled to preserve tree model performance (trees are invariant to feature scaling).
2. Model Performance Comparison
| Model | Dataset | Accuracy | Notes |
|---|---|---|---|
| Baseline (Simple Logistic Regression) | 5-Fold CV | 0.7211 | Strong linear baseline |
| Logistic Regression (with feature engineering + scaling) | Validation Split | 0.6544 | Linear model struggles with complex features |
| Random Forest (hyperparameter tuned) | Validation Split | 0.7340 | ✅ Exceeds baseline by 1.3% |
Random Forest Class-Wise Performance
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Poor (0) | 0.59 | 0.84 | 0.69 | 501 |
| Standard (1) | 0.73 | 0.81 | 0.77 | 832 |
| Good (2) | 0.86 | 0.63 | 0.73 | 1,167 |
| Weighted Avg | 0.76 | 0.73 | 0.73 | 2,500 |
3. Key Insights
Why Logistic Regression Performance Dropped
- Non-linear Feature Interactions: Engineered features (DTI × LoanCount, Debt_Per_Loan) capture non-linear relationships that linear models cannot leverage.
- Dimensionality Curse: One-Hot encoding of multiple categorical features (Occupation, Payment_Behaviour) increased feature space without linear model regularization.
- Information Loss: Dropping
Annual_Incomein favor ofLog_Annual_Incomemay have removed linear signal if the true relationship isn't purely logarithmic.
Why Random Forest Excels
- Non-linear Decision Boundaries: Trees naturally capture feature interactions without explicit engineering.
- High Recall on Poor Scores: 84% recall on class 0 (Poor) is critical for risk management—catches risky customers.
- Balanced Performance: Weighted F1-score of 0.73 shows good generalization across all credit score classes.
- Robustness: Hyperparameters (max_depth=10, balanced_class_weight) prevent overfitting while leveraging complex features.
4. Recommendations for Next Phase (04_model_optimization.ipynb)
✅ Keep the engineered features — They provide valuable signal for non-linear models. ✅ Continue with tree-based models — Random Forest, XGBoost will unlock feature complexity better than linear models. ✅ Perform feature importance analysis — Identify which engineered features drive predictions. ✅ Cross-validate with stratified k-fold — Ensure 73.4% accuracy is stable across data splits. ✅ Compare with XGBoost — Gradient boosting may outperform bagging approaches. ✅ Class-wise optimization — Focus on improving recall for "Poor" customers (high-risk detection).
5. Data Quality Improvements Made
- ✅ Handled missing values in
Monthly_Inhand_Salary,Type_of_Loan,Credit_History_Age. - ✅ Cleaned numeric columns with special characters (underscores, commas).
- ✅ Removed outliers in
Num_of_Delayed_Payment(clipped at 99th percentile). - ✅ Aggregated to customer level to prevent temporal leakage and reduce noise.
- ✅ Verified no NaN values remain before model training.