FinRisk-AI / docs /03_feature_engineering.md
iremrit's picture
Upload 36 files
95409ed verified

Feature Engineering Results

1. Implemented Strategy

We implemented a comprehensive feature engineering pipeline with customer-level aggregation:

Data Cleaning & Type Conversion

  • Numeric Parsing: Cleaned Age, Annual_Income, Outstanding_Debt, Num_of_Delayed_Payment, Num_of_Loan, Amount_invested_monthly, Monthly_Balance, Changed_Credit_Limit (removed underscores, commas, and handled invalid values).

Feature Extraction

  • Credit History Age: Converted from "X Years Y Months" format to total months.
  • Loan Features:
    • Loan_Count_Calculated: Count of different loan types.
    • Loan_<Type>: One-Hot encoded top 8 loan types (Auto, Credit-Builder, Personal, Home Equity, Mortgage, Student, Debt Consolidation, Payday).
  • Financial Ratios:
    • Debt_to_Income_Ratio: Outstanding Debt / Annual Income (financial risk metric).
    • Debt_Per_Loan: Outstanding Debt / Loan Count.
    • Installment_to_Income: Monthly EMI / Monthly Salary (debt service capacity).
    • Delayed_Per_Loan: Num of Delayed Payments / Loan Count (payment reliability).
  • Interaction Features:
    • DTI_x_LoanCount: Debt-to-Income × Loan Count (combined risk).
    • Log_Annual_Income: Log-transformed income (handles skewness).

Imputation & Aggregation

  • Grouped Imputation: Median salary imputation grouped by Occupation (more accurate than global median).
  • Customer-Level Aggregation: Reduced 150,000 monthly rows to 25,000 unique customers:
    • Stable fields (Age, loan flags): First value.
    • Monthly-changing fields (Income, Balance, EMI): Mean.
    • Count fields (Delayed payments, inquiries): Sum.
    • Categorical fields (Payment Behaviour, Credit Mix): Mode.

Encoding & Scaling

  • Ordinal Encoding: Credit_Mix (Bad=0, Standard=1, Good=2).
  • One-Hot Encoding: Occupation, Payment_Behaviour, Month.
  • Label Encoding: Target (Credit_Score).
  • No Global Scaling: Features remain unscaled to preserve tree model performance (trees are invariant to feature scaling).

2. Model Performance Comparison

Model Dataset Accuracy Notes
Baseline (Simple Logistic Regression) 5-Fold CV 0.7211 Strong linear baseline
Logistic Regression (with feature engineering + scaling) Validation Split 0.6544 Linear model struggles with complex features
Random Forest (hyperparameter tuned) Validation Split 0.7340 Exceeds baseline by 1.3%

Random Forest Class-Wise Performance

Class Precision Recall F1-Score Support
Poor (0) 0.59 0.84 0.69 501
Standard (1) 0.73 0.81 0.77 832
Good (2) 0.86 0.63 0.73 1,167
Weighted Avg 0.76 0.73 0.73 2,500

3. Key Insights

Why Logistic Regression Performance Dropped

  1. Non-linear Feature Interactions: Engineered features (DTI × LoanCount, Debt_Per_Loan) capture non-linear relationships that linear models cannot leverage.
  2. Dimensionality Curse: One-Hot encoding of multiple categorical features (Occupation, Payment_Behaviour) increased feature space without linear model regularization.
  3. Information Loss: Dropping Annual_Income in favor of Log_Annual_Income may have removed linear signal if the true relationship isn't purely logarithmic.

Why Random Forest Excels

  1. Non-linear Decision Boundaries: Trees naturally capture feature interactions without explicit engineering.
  2. High Recall on Poor Scores: 84% recall on class 0 (Poor) is critical for risk management—catches risky customers.
  3. Balanced Performance: Weighted F1-score of 0.73 shows good generalization across all credit score classes.
  4. Robustness: Hyperparameters (max_depth=10, balanced_class_weight) prevent overfitting while leveraging complex features.

4. Recommendations for Next Phase (04_model_optimization.ipynb)

Keep the engineered features — They provide valuable signal for non-linear models. ✅ Continue with tree-based models — Random Forest, XGBoost will unlock feature complexity better than linear models. ✅ Perform feature importance analysis — Identify which engineered features drive predictions. ✅ Cross-validate with stratified k-fold — Ensure 73.4% accuracy is stable across data splits. ✅ Compare with XGBoost — Gradient boosting may outperform bagging approaches. ✅ Class-wise optimization — Focus on improving recall for "Poor" customers (high-risk detection).

5. Data Quality Improvements Made

  • ✅ Handled missing values in Monthly_Inhand_Salary, Type_of_Loan, Credit_History_Age.
  • ✅ Cleaned numeric columns with special characters (underscores, commas).
  • ✅ Removed outliers in Num_of_Delayed_Payment (clipped at 99th percentile).
  • ✅ Aggregated to customer level to prevent temporal leakage and reduce noise.
  • ✅ Verified no NaN values remain before model training.