Spaces:
Build error
Build error
| # Insurance Fraud Prediction Model | |
| This project focuses on building and evaluating a machine learning model to detect fraudulent insurance claims. | |
| The project involves data preprocessing, model training using a RandomForestClassifier, model evaluation with | |
| various metrics and visualizations, and a Streamlit UI for interacting with the model. | |
| Create and activate a virtual environment: | |
| ```bash | |
| python -m venv env | |
| source env/bin/activate # On Windows use `env\Scripts\activate` | |
| ``` | |
| Install the required packages: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Project Structure | |
| ```bash | |
| insurance-fraud-detection/ | |
| β | |
| βββ dataset/ | |
| β βββ insurance_claims.csv | |
| β | |
| βββ model/ | |
| β βββ only_model.joblib | |
| β | |
| βββ train.py | |
| βββ prediction.py | |
| βββ app.py | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |
| ### Data Preprocessing | |
| #### Data Loading | |
| The data is loaded from a CSV file located at dataset/insurance_claims.csv. During loading, the following steps are | |
| performed: | |
| - Drop the _c39 column. | |
| - Replace '?' with NaN. | |
| #### Data Cleaning | |
| Fill missing values for 'property_damage', 'police_report_available', and 'collision_type' columns with their mode. | |
| Drop duplicate records. | |
| #### Encoding and Feature Selection | |
| Encode categorical variables using Label Encoding. | |
| Drop unnecessary columns that are not relevant for the model. | |
| Select the final set of features for the model. | |
| #### Preprocessed Features | |
| The final set of features used for model training: | |
| incident_severity | |
| insured_hobbies | |
| total_claim_amount | |
| months_as_customer | |
| policy_annual_premium | |
| incident_date | |
| capital-loss | |
| capital-gains | |
| insured_education_level | |
| incident_city | |
| fraud_reported (target variable) | |
| #### Model Training | |
| The model is trained using a RandomForestClassifier with a pipeline that includes preprocessing steps and | |
| hyperparameter tuning using GridSearchCV. | |
| #### Training Steps | |
| Train-test split: The data is split into training and testing sets with a 70-30 split. | |
| Pipeline setup: A pipeline is created to include preprocessing and model training. | |
| Hyperparameter tuning: A grid search is performed to find the best hyperparameters. | |
| Model training: The best model is trained on the training data. | |
| Model saving: The trained model is saved as fraud_insurance_pipeline.joblib. | |
| #### Model Evaluation | |
| The trained model is evaluated using the test set. The evaluation metrics include: | |
| Classification Report: Precision, Recall, F1-score. | |
| AUC Score: Area Under the ROC Curve. | |
| Confusion Matrix: Visual representation of true vs. predicted values. | |
| ROC Curve: Receiver Operating Characteristic curve. | |
| ### Usage | |
| #### Training the Model | |
| To train the model, run the following command: | |
| ```bash | |
| python train.py | |
| ``` | |
| #### Evaluating the Model | |
| To evaluate the model, run the following command: | |
| ```bash | |
| python predict.py | |
| ``` | |
| #### Running the Streamlit App | |
| To run the Streamlit app, use the following command: | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` |