Spaces:

iremrit
/

FinRisk-AI

Sleeping

App Files Files Community

FinRisk-AI / docs /02_baseline.md

iremrit

Upload 36 files

95409ed verified 2 months ago

preview code

raw

history blame contribute delete

2.42 kB

	# Baseline Model Documentation

	## 1. Pipeline Overview
	The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline.

	### Preprocessing
	- Dropped Columns: `ID`, `Customer_ID`, `Name`, `SSN`, `Credit_Score` (Target).
	- Data Cleaning:
	- Numeric Parsing: Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed `_`, `,` and handled empty strings).
	- Credit_History_Age: Parsed "X Years Y Months" into total months.
	- Imputation & Scaling (Numeric):
	- `SimpleImputer(strategy='median')`
	- `StandardScaler()`
	- Encoding (Categorical):
	- `SimpleImputer(strategy='most_frequent')`
	- `OneHotEncoder(handle_unknown='ignore')`
	- Target (`Credit_Score`): Label Encoded.

	### Model
	- Algorithm: Logistic Regression
	- Parameters: `max_iter=1000`, `class_weight='balanced'`, `random_state=42`
	- Validation: Stratified K-Fold Cross-Validation (5 Splits).

	## 2. Performance Results

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Mean Accuracy \| 0.7211 (+/- 0.0020) \|
	\| Mean ROC-AUC \| 0.8647 \|

	### Fold-by-Fold Breakdown

	\| Fold \| Accuracy \| ROC-AUC \|
	\| :--- \| :--- \| :--- \|
	\| Fold 1 \| 0.7228 \| 0.8660 \|
	\| Fold 2 \| 0.7238 \| 0.8647 \|
	\| Fold 3 \| 0.7180 \| 0.8642 \|
	\| Fold 4 \| 0.7202 \| 0.8635 \|
	\| Fold 5 \| 0.7204 \| 0.8648 \|

	## 3. Key Findings
	- Significant Improvement: Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline.
	- Robustness: Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well.
	- Strong Discrimination: ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model.
	- Remaining Gap: The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting).
	- Next Steps: Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.

	# Baseline Model Documentation

	## 1. Pipeline Overview
	The baseline model has been upgraded to replicate a high-performing preprocessing pipeline, significantly improving upon the initial minimal baseline.

	### Preprocessing
	- Dropped Columns: `ID`, `Customer_ID`, `Name`, `SSN`, `Credit_Score` (Target).
	- Data Cleaning:
	- Numeric Parsing: Cleaned `Age`, `Annual_Income`, `Outstanding_Debt`, `Num_of_Delayed_Payment`, `Num_of_Loan`, `Amount_invested_monthly`, `Monthly_Balance`, `Changed_Credit_Limit` (removed `_`, `,` and handled empty strings).
	- Credit_History_Age: Parsed "X Years Y Months" into total months.
	- Imputation & Scaling (Numeric):
	- `SimpleImputer(strategy='median')`
	- `StandardScaler()`
	- Encoding (Categorical):
	- `SimpleImputer(strategy='most_frequent')`
	- `OneHotEncoder(handle_unknown='ignore')`
	- Target (`Credit_Score`): Label Encoded.

	### Model
	- Algorithm: Logistic Regression
	- Parameters: `max_iter=1000`, `class_weight='balanced'`, `random_state=42`
	- Validation: Stratified K-Fold Cross-Validation (5 Splits).

	## 2. Performance Results

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Mean Accuracy \| 0.7211 (+/- 0.0020) \|
	\| Mean ROC-AUC \| 0.8647 \|

	### Fold-by-Fold Breakdown

	\| Fold \| Accuracy \| ROC-AUC \|
	\| :--- \| :--- \| :--- \|
	\| Fold 1 \| 0.7228 \| 0.8660 \|
	\| Fold 2 \| 0.7238 \| 0.8647 \|
	\| Fold 3 \| 0.7180 \| 0.8642 \|
	\| Fold 4 \| 0.7202 \| 0.8635 \|
	\| Fold 5 \| 0.7204 \| 0.8648 \|

	## 3. Key Findings
	- Significant Improvement: Accuracy improved from ~62% to ~72.1% by correctly handling dirty numeric columns (Age, Annual_Income, etc.) and using a robust preprocessing pipeline.
	- Robustness: Stratified K-Fold CV (5 splits) ensures the results are stable with low variance (+/- 0.0020), indicating the model generalizes well.
	- Strong Discrimination: ROC-AUC of 0.8647 shows the model effectively distinguishes between credit score classes despite being a simple linear model.
	- Remaining Gap: The target performance is ~80%. The 9% gap can be closed through advanced feature engineering (e.g., customer-level aggregation, feature interactions, loan type splitting).
	- Next Steps: Implement advanced feature engineering with non-linear models (Random Forest, XGBoost) to leverage complex feature relationships.