YashChowdhary commited on
Commit
b61d076
·
verified ·
1 Parent(s): 11db810

Upload 5 files

Browse files
Files changed (5) hide show
  1. BEGINNER_GUIDE.md +360 -0
  2. app.py +742 -0
  3. requirements.txt +20 -0
  4. test.csv +0 -0
  5. train.csv +0 -0
BEGINNER_GUIDE.md ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Complete Beginner's Guide: Deploying Auto Insurance Fraud Detection on Hugging Face
2
+
3
+ This guide walks you through every step of setting up and running the fraud detection project. No prior experience with Hugging Face is required.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+ 1. [What You'll Need](#what-youll-need)
9
+ 2. [Step 1: Create a Hugging Face Account](#step-1-create-a-hugging-face-account)
10
+ 3. [Step 2: Create a New Space](#step-2-create-a-new-space)
11
+ 4. [Step 3: Upload Your Files](#step-3-upload-your-files)
12
+ 5. [Step 4: Wait for Build](#step-4-wait-for-build)
13
+ 6. [Step 5: Use Your App](#step-5-use-your-app)
14
+ 7. [Troubleshooting Common Issues](#troubleshooting-common-issues)
15
+ 8. [Running Locally (Alternative)](#running-locally-alternative)
16
+ 9. [Understanding the Output](#understanding-the-output)
17
+
18
+ ---
19
+
20
+ ## What You'll Need
21
+
22
+ Before starting, make sure you have these 5 files ready in a folder on your computer:
23
+
24
+ | File | Description | Size (approx) |
25
+ |------|-------------|---------------|
26
+ | `app.py` | The main Python application code | ~25 KB |
27
+ | `requirements.txt` | List of Python packages needed | ~300 bytes |
28
+ | `train.csv` | Training dataset | ~1.9 MB |
29
+ | `test.csv` | Test dataset | ~470 KB |
30
+ | `README.md` | Documentation for the Space | ~4 KB |
31
+
32
+ You should also have:
33
+ - `report.docx` - The APA report (keep this separate, not uploaded to Hugging Face)
34
+ - This guide (`BEGINNER_GUIDE.md`) for reference
35
+
36
+ ---
37
+
38
+ ## Step 1: Create a Hugging Face Account
39
+
40
+ 1. **Go to Hugging Face**: Open your browser and visit [https://huggingface.co](https://huggingface.co)
41
+
42
+ 2. **Click "Sign Up"**: Look for the button in the top-right corner
43
+
44
+ 3. **Fill in your details**:
45
+ - Username: Choose something memorable (e.g., `yourname_student`)
46
+ - Email: Use your school or personal email
47
+ - Password: Make it secure
48
+
49
+ 4. **Verify your email**: Check your inbox and click the verification link
50
+
51
+ 5. **Complete your profile** (optional but recommended)
52
+
53
+ ---
54
+
55
+ ## Step 2: Create a New Space
56
+
57
+ Hugging Face "Spaces" are where you host web applications. Here's how to create one:
58
+
59
+ 1. **Go to Spaces**: After logging in, click on your profile picture → "New Space"
60
+
61
+ Or directly visit: [https://huggingface.co/new-space](https://huggingface.co/new-space)
62
+
63
+ 2. **Configure your Space**:
64
+
65
+ | Setting | What to Enter |
66
+ |---------|---------------|
67
+ | **Space name** | `fraud-detection` (or any name you like) |
68
+ | **License** | MIT (allows others to use your code) |
69
+ | **SDK** | Select **Gradio** |
70
+ | **SDK Version** | Leave as default (or select 4.19.2) |
71
+ | **Hardware** | **CPU basic** (free tier) |
72
+ | **Visibility** | Public (or Private if you prefer) |
73
+
74
+ 3. **Click "Create Space"**
75
+
76
+ You now have an empty Space! It will show an error because there's no code yet—that's normal.
77
+
78
+ ---
79
+
80
+ ## Step 3: Upload Your Files
81
+
82
+ You have two options for uploading files:
83
+
84
+ ### Option A: Web Interface (Easiest)
85
+
86
+ 1. **Go to your Space**: Click on your Space name (e.g., `yourusername/fraud-detection`)
87
+
88
+ 2. **Click "Files and versions"** tab
89
+
90
+ 3. **Click "+ Add file"** → **"Upload files"**
91
+
92
+ 4. **Upload these files one by one or all together**:
93
+ - `app.py`
94
+ - `requirements.txt`
95
+ - `train.csv`
96
+ - `test.csv`
97
+ - `README.md`
98
+
99
+ 5. **Commit the changes**: After each upload (or batch), you'll see a "Commit" button. Click it.
100
+
101
+ ⚠️ **Important**: Upload the README.md file. The one you created should replace any default README.
102
+
103
+ ### Option B: Git (For more advanced users)
104
+
105
+ If you're familiar with Git, you can clone and push:
106
+
107
+ ```bash
108
+ # Clone your space
109
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/fraud-detection
110
+ cd fraud-detection
111
+
112
+ # Copy your files into this folder
113
+ cp /path/to/your/files/* .
114
+
115
+ # Add, commit, and push
116
+ git add .
117
+ git commit -m "Initial upload of fraud detection app"
118
+ git push
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Step 4: Wait for Build
124
+
125
+ After uploading, Hugging Face automatically builds your app. Here's what happens:
126
+
127
+ 1. **Building** (1-3 minutes): The status shows "Building"
128
+ - It installs packages from `requirements.txt`
129
+ - It prepares the environment
130
+
131
+ 2. **Running** (3-5 minutes the first time): The status shows "Running"
132
+ - Your `app.py` code executes
133
+ - Models are trained
134
+ - The interface loads
135
+
136
+ 3. **App Ready**: You'll see your app interface!
137
+
138
+ ### What the logs show:
139
+
140
+ You can click "Logs" to see what's happening:
141
+
142
+ ```
143
+ Loading data...
144
+ Applying SMOTE to handle class imbalance...
145
+ Training models (this may take a moment)...
146
+ Training XGBoost...
147
+ Training LightGBM...
148
+ Training Random Forest...
149
+ Training Logistic Regression...
150
+ Models trained successfully!
151
+ Running on local URL: http://0.0.0.0:7860
152
+ ```
153
+
154
+ When you see that last line, your app is ready!
155
+
156
+ ---
157
+
158
+ ## Step 5: Use Your App
159
+
160
+ Once your app is running, you'll see a interactive interface with 5 tabs:
161
+
162
+ ### Tab 1: 📊 Data Overview
163
+ - Shows dataset statistics
164
+ - Displays class distribution pie charts
165
+ - Explains the imbalance problem
166
+
167
+ ### Tab 2: ���� Model Evaluation
168
+ - **Select a model** from the dropdown (XGBoost, LightGBM, Random Forest, or Logistic Regression)
169
+ - **Select a visualization** to see:
170
+ - Precision-Recall Curve
171
+ - ROC Curve
172
+ - Confusion Matrix
173
+ - Feature Importance
174
+ - Threshold Analysis
175
+ - View performance metrics and classification report
176
+
177
+ ### Tab 3: 📈 Compare Models
178
+ - Side-by-side comparison of all 4 models
179
+ - Bar chart showing metrics
180
+ - Table with best model for each metric
181
+
182
+ ### Tab 4: ⚖️ Threshold Optimization
183
+ - Interactive plot showing precision/recall trade-off
184
+ - Table of optimal thresholds for each model
185
+ - Explains why 0.5 isn't always the best threshold
186
+
187
+ ### Tab 5: ℹ️ About
188
+ - Project documentation
189
+ - Technical details
190
+ - Metrics explanations
191
+
192
+ ---
193
+
194
+ ## Troubleshooting Common Issues
195
+
196
+ ### Issue 1: "Application Error" or Build Fails
197
+
198
+ **Possible causes and solutions**:
199
+
200
+ | Problem | Solution |
201
+ |---------|----------|
202
+ | Missing file | Check all 5 files are uploaded |
203
+ | Wrong filename | Files must be exactly: `app.py`, `requirements.txt`, `train.csv`, `test.csv`, `README.md` |
204
+ | Corrupted CSV | Re-download the CSV files and upload again |
205
+ | Package conflict | Check the logs for specific error messages |
206
+
207
+ ### Issue 2: "Out of Memory" Error
208
+
209
+ The free tier has limited memory. This shouldn't happen with our code, but if it does:
210
+ - Reduce `n_estimators` in the models (e.g., from 100 to 50)
211
+ - The app automatically uses efficient settings for free tier
212
+
213
+ ### Issue 3: App Takes Too Long to Load
214
+
215
+ Normal behavior! The first load takes 3-5 minutes because:
216
+ - It needs to install packages
217
+ - It trains 4 machine learning models
218
+ - Subsequent visits are faster (cache helps)
219
+
220
+ ### Issue 4: Graphs Don't Update
221
+
222
+ - Try clicking the dropdown again
223
+ - Refresh the page (Ctrl+R or Cmd+R)
224
+ - Wait a few seconds—processing takes time
225
+
226
+ ### Issue 5: "No such file: train.csv"
227
+
228
+ The CSV files weren't uploaded correctly:
229
+ 1. Go to "Files and versions"
230
+ 2. Verify `train.csv` and `test.csv` are listed
231
+ 3. If not, upload them again
232
+ 4. Make sure filenames are lowercase
233
+
234
+ ---
235
+
236
+ ## Running Locally (Alternative)
237
+
238
+ If you prefer to run on your own computer instead of Hugging Face:
239
+
240
+ ### Step 1: Install Python
241
+ Make sure you have Python 3.8+ installed. Check with:
242
+ ```bash
243
+ python --version
244
+ ```
245
+
246
+ ### Step 2: Create a Project Folder
247
+ ```bash
248
+ mkdir fraud_detection
249
+ cd fraud_detection
250
+ ```
251
+
252
+ ### Step 3: Copy All Files
253
+ Place these files in your folder:
254
+ - `app.py`
255
+ - `requirements.txt`
256
+ - `train.csv`
257
+ - `test.csv`
258
+
259
+ ### Step 4: Create a Virtual Environment (Recommended)
260
+ ```bash
261
+ # Create virtual environment
262
+ python -m venv venv
263
+
264
+ # Activate it
265
+ # On Windows:
266
+ venv\Scripts\activate
267
+ # On Mac/Linux:
268
+ source venv/bin/activate
269
+ ```
270
+
271
+ ### Step 5: Install Dependencies
272
+ ```bash
273
+ pip install -r requirements.txt
274
+ ```
275
+
276
+ This may take 2-5 minutes to download and install all packages.
277
+
278
+ ### Step 6: Run the App
279
+ ```bash
280
+ python app.py
281
+ ```
282
+
283
+ ### Step 7: Open in Browser
284
+ You'll see output like:
285
+ ```
286
+ Running on local URL: http://127.0.0.1:7860
287
+ ```
288
+ Open that URL in your browser.
289
+
290
+ ---
291
+
292
+ ## Understanding the Output
293
+
294
+ ### What the Metrics Mean
295
+
296
+ | Metric | What It Tells You | Good Value |
297
+ |--------|------------------|------------|
298
+ | **Accuracy** | Overall correct predictions | >95% (but misleading for imbalanced data) |
299
+ | **Precision** | When we say "fraud", how often are we right? | >50% is good |
300
+ | **Recall** | Of all actual frauds, how many did we catch? | >70% is good |
301
+ | **F1 Score** | Balance of precision and recall | >0.5 is decent, >0.6 is good |
302
+ | **ROC AUC** | Overall discrimination ability | >0.9 is excellent |
303
+
304
+ ### Reading the Confusion Matrix
305
+
306
+ ```
307
+ Predicted
308
+ Legit Fraud
309
+ Actual Legit [TN] [FP]
310
+ Fraud [FN] [TP]
311
+ ```
312
+
313
+ - **TN (True Negative)**: Correctly identified legitimate claims ✓
314
+ - **FP (False Positive)**: Legitimate claims wrongly flagged as fraud ✗
315
+ - **FN (False Negative)**: Frauds we missed ✗
316
+ - **TP (True Positive)**: Correctly caught frauds ✓
317
+
318
+ ### Interpreting Feature Importance
319
+
320
+ The feature importance plot shows which variables most influence the model's decisions:
321
+
322
+ - **High importance features** = strong predictors of fraud
323
+ - For example, `total_claim_amount` being important means higher claims correlate with fraud
324
+ - Use this to understand what patterns the model learned
325
+
326
+ ---
327
+
328
+ ## File Checklist
329
+
330
+ Before deploying, verify you have all files:
331
+
332
+ - [ ] `app.py` - Main application (~650 lines of code)
333
+ - [ ] `requirements.txt` - Package list (11 packages)
334
+ - [ ] `train.csv` - 16,001 rows including header
335
+ - [ ] `test.csv` - 4,001 rows including header
336
+ - [ ] `README.md` - Documentation for the Space
337
+
338
+ Keep separately (don't upload to Hugging Face):
339
+ - [ ] `report.docx` - Your APA report for submission
340
+ - [ ] `BEGINNER_GUIDE.md` - This guide
341
+
342
+ ---
343
+
344
+ ## Tips for Success
345
+
346
+ 1. **Be patient**: First build takes a few minutes
347
+ 2. **Check logs**: They tell you exactly what's happening
348
+ 3. **Verify uploads**: Make sure file sizes match expectations
349
+ 4. **Use the right SDK**: Must be Gradio, not Streamlit
350
+ 5. **Test locally first**: If something doesn't work, debug locally before deploying
351
+
352
+ ---
353
+
354
+ ## Need Help?
355
+
356
+ - **Hugging Face Documentation**: [https://huggingface.co/docs/hub/spaces](https://huggingface.co/docs/hub/spaces)
357
+ - **Gradio Guides**: [https://gradio.app/guides](https://gradio.app/guides)
358
+ - **Common Issues**: Check the "Troubleshooting" section above
359
+
360
+ Good luck with your fraud detection project! 🎉
app.py ADDED
@@ -0,0 +1,742 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Auto Insurance Claims Fraud Detection
3
+ =====================================
4
+ A machine learning application that trains and compares 4 different models
5
+ for detecting fraudulent insurance claims.
6
+
7
+ Models: XGBoost, LightGBM, Random Forest, Logistic Regression
8
+ Author: Data Science Project
9
+ """
10
+
11
+ import gradio as gr
12
+ import pandas as pd
13
+ import numpy as np
14
+ import matplotlib.pyplot as plt
15
+ import seaborn as sns
16
+ from io import BytesIO
17
+ import base64
18
+ import warnings
19
+ warnings.filterwarnings('ignore')
20
+
21
+ # ML Libraries
22
+ from sklearn.model_selection import cross_val_score
23
+ from sklearn.metrics import (
24
+ precision_recall_curve, roc_curve, auc,
25
+ confusion_matrix, classification_report,
26
+ f1_score, precision_score, recall_score, accuracy_score
27
+ )
28
+ from sklearn.linear_model import LogisticRegression
29
+ from sklearn.ensemble import RandomForestClassifier
30
+ from xgboost import XGBClassifier
31
+ from lightgbm import LGBMClassifier
32
+ from imblearn.over_sampling import SMOTE
33
+
34
+ # Set style for all plots - using try/except for compatibility
35
+ try:
36
+ plt.style.use('seaborn-v0_8-whitegrid')
37
+ except:
38
+ try:
39
+ plt.style.use('seaborn-whitegrid')
40
+ except:
41
+ plt.style.use('ggplot') # Fallback style
42
+ sns.set_palette("husl")
43
+
44
+ # ============================================================================
45
+ # DATA LOADING AND PREPROCESSING
46
+ # ============================================================================
47
+
48
+ def load_and_prepare_data():
49
+ """
50
+ Load the train and test datasets.
51
+ The data is already preprocessed and one-hot encoded.
52
+ """
53
+ # Load datasets
54
+ train_df = pd.read_csv('train.csv')
55
+ test_df = pd.read_csv('test.csv')
56
+
57
+ # Separate features and target
58
+ # 'fraud' is our target variable (0 = legitimate, 1 = fraudulent)
59
+ X_train = train_df.drop('fraud', axis=1)
60
+ y_train = train_df['fraud']
61
+ X_test = test_df.drop('fraud', axis=1)
62
+ y_test = test_df['fraud']
63
+
64
+ return X_train, X_test, y_train, y_test, train_df, test_df
65
+
66
+
67
+ def apply_smote(X_train, y_train):
68
+ """
69
+ Apply SMOTE to handle class imbalance.
70
+ Fraud cases are rare (~3%), so we oversample the minority class.
71
+ """
72
+ smote = SMOTE(random_state=42)
73
+ X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
74
+ return X_resampled, y_resampled
75
+
76
+
77
+ # ============================================================================
78
+ # MODEL DEFINITIONS
79
+ # ============================================================================
80
+
81
+ def get_models():
82
+ """
83
+ Define the 4 models we'll compare.
84
+ Each model is tuned for imbalanced fraud detection.
85
+ """
86
+ models = {
87
+ 'XGBoost': XGBClassifier(
88
+ n_estimators=100,
89
+ max_depth=4,
90
+ learning_rate=0.1,
91
+ scale_pos_weight=10, # Helps with imbalanced data
92
+ random_state=42,
93
+ use_label_encoder=False,
94
+ eval_metric='logloss'
95
+ ),
96
+ 'LightGBM': LGBMClassifier(
97
+ n_estimators=100,
98
+ max_depth=4,
99
+ learning_rate=0.1,
100
+ class_weight='balanced', # Handles imbalance internally
101
+ random_state=42,
102
+ verbose=-1
103
+ ),
104
+ 'Random Forest': RandomForestClassifier(
105
+ n_estimators=100,
106
+ max_depth=6,
107
+ class_weight='balanced',
108
+ random_state=42,
109
+ n_jobs=-1
110
+ ),
111
+ 'Logistic Regression': LogisticRegression(
112
+ class_weight='balanced',
113
+ max_iter=1000,
114
+ random_state=42
115
+ )
116
+ }
117
+ return models
118
+
119
+
120
+ # ============================================================================
121
+ # MODEL TRAINING AND EVALUATION
122
+ # ============================================================================
123
+
124
+ def train_model(model, X_train, y_train):
125
+ """Train a single model and return the fitted model."""
126
+ model.fit(X_train, y_train)
127
+ return model
128
+
129
+
130
+ def evaluate_model(model, X_test, y_test):
131
+ """
132
+ Get predictions and probabilities from a trained model.
133
+ Returns both hard predictions and probability scores.
134
+ """
135
+ y_pred = model.predict(X_test)
136
+ y_proba = model.predict_proba(X_test)[:, 1] # Probability of fraud
137
+ return y_pred, y_proba
138
+
139
+
140
+ def get_metrics(y_test, y_pred, y_proba):
141
+ """
142
+ Calculate all relevant metrics for fraud detection.
143
+ For imbalanced data, we focus on Precision, Recall, and F1.
144
+ """
145
+ metrics = {
146
+ 'Accuracy': accuracy_score(y_test, y_pred),
147
+ 'Precision': precision_score(y_test, y_pred, zero_division=0),
148
+ 'Recall': recall_score(y_test, y_pred, zero_division=0),
149
+ 'F1 Score': f1_score(y_test, y_pred, zero_division=0),
150
+ 'ROC AUC': auc(*roc_curve(y_test, y_proba)[:2])
151
+ }
152
+ return metrics
153
+
154
+
155
+ def find_optimal_threshold(y_test, y_proba):
156
+ """
157
+ Find the optimal classification threshold using F1 score.
158
+ Default threshold is 0.5, but for imbalanced data,
159
+ a different threshold often works better.
160
+ """
161
+ thresholds = np.arange(0.1, 0.9, 0.01)
162
+ f1_scores = []
163
+
164
+ for thresh in thresholds:
165
+ y_pred_thresh = (y_proba >= thresh).astype(int)
166
+ f1 = f1_score(y_test, y_pred_thresh, zero_division=0)
167
+ f1_scores.append(f1)
168
+
169
+ # Find threshold with best F1 score
170
+ best_idx = np.argmax(f1_scores)
171
+ best_threshold = thresholds[best_idx]
172
+ best_f1 = f1_scores[best_idx]
173
+
174
+ return best_threshold, best_f1, thresholds, f1_scores
175
+
176
+
177
+ # ============================================================================
178
+ # VISUALIZATION FUNCTIONS
179
+ # ============================================================================
180
+
181
+ def plot_precision_recall_curve(y_test, y_proba, model_name):
182
+ """
183
+ Plot Precision-Recall curve.
184
+ This is the most important metric for fraud detection because
185
+ we care about catching frauds (recall) without too many false alarms (precision).
186
+ """
187
+ precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
188
+ pr_auc = auc(recall, precision)
189
+
190
+ fig, ax = plt.subplots(figsize=(8, 6))
191
+ ax.plot(recall, precision, 'b-', linewidth=2, label=f'{model_name} (AUC = {pr_auc:.3f})')
192
+ ax.fill_between(recall, precision, alpha=0.2)
193
+
194
+ # Add baseline (random classifier)
195
+ baseline = y_test.mean()
196
+ ax.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline = {baseline:.3f}')
197
+
198
+ ax.set_xlabel('Recall (Fraud Detection Rate)', fontsize=12)
199
+ ax.set_ylabel('Precision (True Fraud Rate)', fontsize=12)
200
+ ax.set_title(f'Precision-Recall Curve: {model_name}', fontsize=14, fontweight='bold')
201
+ ax.legend(loc='best')
202
+ ax.set_xlim([0, 1])
203
+ ax.set_ylim([0, 1])
204
+ ax.grid(True, alpha=0.3)
205
+
206
+ plt.tight_layout()
207
+ return fig
208
+
209
+
210
+ def plot_roc_curve(y_test, y_proba, model_name):
211
+ """
212
+ Plot ROC curve showing true positive rate vs false positive rate.
213
+ AUC closer to 1 means better discrimination between fraud and legitimate claims.
214
+ """
215
+ fpr, tpr, thresholds = roc_curve(y_test, y_proba)
216
+ roc_auc = auc(fpr, tpr)
217
+
218
+ fig, ax = plt.subplots(figsize=(8, 6))
219
+ ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'{model_name} (AUC = {roc_auc:.3f})')
220
+ ax.fill_between(fpr, tpr, alpha=0.2)
221
+
222
+ # Random classifier line
223
+ ax.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
224
+
225
+ ax.set_xlabel('False Positive Rate', fontsize=12)
226
+ ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
227
+ ax.set_title(f'ROC Curve: {model_name}', fontsize=14, fontweight='bold')
228
+ ax.legend(loc='lower right')
229
+ ax.set_xlim([0, 1])
230
+ ax.set_ylim([0, 1])
231
+ ax.grid(True, alpha=0.3)
232
+
233
+ plt.tight_layout()
234
+ return fig
235
+
236
+
237
+ def plot_confusion_matrix(y_test, y_pred, model_name):
238
+ """
239
+ Plot confusion matrix as a heatmap.
240
+ Shows: True Negatives, False Positives, False Negatives, True Positives.
241
+ """
242
+ cm = confusion_matrix(y_test, y_pred)
243
+
244
+ fig, ax = plt.subplots(figsize=(8, 6))
245
+
246
+ # Create heatmap with custom colors
247
+ sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
248
+ xticklabels=['Legitimate', 'Fraud'],
249
+ yticklabels=['Legitimate', 'Fraud'],
250
+ annot_kws={'size': 16})
251
+
252
+ ax.set_xlabel('Predicted Label', fontsize=12)
253
+ ax.set_ylabel('True Label', fontsize=12)
254
+ ax.set_title(f'Confusion Matrix: {model_name}', fontsize=14, fontweight='bold')
255
+
256
+ # Add text annotations explaining the quadrants
257
+ total = cm.sum()
258
+ tn, fp, fn, tp = cm.ravel()
259
+
260
+ text = f"TN: {tn} | FP: {fp}\nFN: {fn} | TP: {tp}"
261
+ ax.text(1.4, 0.5, text, transform=ax.transAxes, fontsize=10,
262
+ verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
263
+
264
+ plt.tight_layout()
265
+ return fig
266
+
267
+
268
+ def plot_feature_importance(model, feature_names, model_name):
269
+ """
270
+ Plot top 15 most important features.
271
+ Different models calculate importance differently:
272
+ - Tree models: based on split gain
273
+ - Logistic Regression: based on coefficient magnitude
274
+ """
275
+ fig, ax = plt.subplots(figsize=(10, 8))
276
+
277
+ # Get feature importances based on model type
278
+ if hasattr(model, 'feature_importances_'):
279
+ importances = model.feature_importances_
280
+ elif hasattr(model, 'coef_'):
281
+ importances = np.abs(model.coef_[0])
282
+ else:
283
+ # Fallback: return empty plot
284
+ ax.text(0.5, 0.5, 'Feature importance not available',
285
+ ha='center', va='center', fontsize=14)
286
+ return fig
287
+
288
+ # Create dataframe and sort by importance
289
+ importance_df = pd.DataFrame({
290
+ 'Feature': feature_names,
291
+ 'Importance': importances
292
+ }).sort_values('Importance', ascending=True).tail(15)
293
+
294
+ # Horizontal bar chart
295
+ colors = plt.cm.Blues(np.linspace(0.4, 0.8, len(importance_df)))
296
+ ax.barh(importance_df['Feature'], importance_df['Importance'], color=colors)
297
+
298
+ ax.set_xlabel('Importance Score', fontsize=12)
299
+ ax.set_title(f'Top 15 Feature Importances: {model_name}', fontsize=14, fontweight='bold')
300
+ ax.grid(True, alpha=0.3, axis='x')
301
+
302
+ plt.tight_layout()
303
+ return fig
304
+
305
+
306
+ def plot_threshold_analysis(y_test, y_proba, model_name):
307
+ """
308
+ Plot how different thresholds affect precision, recall, and F1.
309
+ Helps visualize the trade-off and find the optimal threshold.
310
+ """
311
+ thresholds = np.arange(0.05, 0.95, 0.01)
312
+ precisions = []
313
+ recalls = []
314
+ f1_scores = []
315
+
316
+ for thresh in thresholds:
317
+ y_pred_thresh = (y_proba >= thresh).astype(int)
318
+ precisions.append(precision_score(y_test, y_pred_thresh, zero_division=0))
319
+ recalls.append(recall_score(y_test, y_pred_thresh, zero_division=0))
320
+ f1_scores.append(f1_score(y_test, y_pred_thresh, zero_division=0))
321
+
322
+ # Find optimal threshold
323
+ best_idx = np.argmax(f1_scores)
324
+ best_threshold = thresholds[best_idx]
325
+
326
+ fig, ax = plt.subplots(figsize=(10, 6))
327
+
328
+ ax.plot(thresholds, precisions, 'b-', linewidth=2, label='Precision')
329
+ ax.plot(thresholds, recalls, 'g-', linewidth=2, label='Recall')
330
+ ax.plot(thresholds, f1_scores, 'r-', linewidth=2, label='F1 Score')
331
+
332
+ # Mark optimal threshold
333
+ ax.axvline(x=best_threshold, color='purple', linestyle='--',
334
+ label=f'Optimal Threshold = {best_threshold:.2f}')
335
+ ax.axvline(x=0.5, color='gray', linestyle=':', alpha=0.7, label='Default (0.5)')
336
+
337
+ ax.set_xlabel('Classification Threshold', fontsize=12)
338
+ ax.set_ylabel('Score', fontsize=12)
339
+ ax.set_title(f'Threshold Analysis: {model_name}', fontsize=14, fontweight='bold')
340
+ ax.legend(loc='best')
341
+ ax.set_xlim([0, 1])
342
+ ax.set_ylim([0, 1])
343
+ ax.grid(True, alpha=0.3)
344
+
345
+ plt.tight_layout()
346
+ return fig
347
+
348
+
349
+ def plot_class_distribution(train_df, test_df):
350
+ """
351
+ Plot the class distribution showing imbalance.
352
+ Fraud is rare (~3%) which is typical in real-world scenarios.
353
+ """
354
+ fig, axes = plt.subplots(1, 2, figsize=(12, 5))
355
+
356
+ # Training data distribution
357
+ train_counts = train_df['fraud'].value_counts()
358
+ colors = ['#2ecc71', '#e74c3c']
359
+ axes[0].pie(train_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
360
+ colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
361
+ axes[0].set_title('Training Data Distribution', fontsize=14, fontweight='bold')
362
+
363
+ # Test data distribution
364
+ test_counts = test_df['fraud'].value_counts()
365
+ axes[1].pie(test_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
366
+ colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
367
+ axes[1].set_title('Test Data Distribution', fontsize=14, fontweight='bold')
368
+
369
+ plt.suptitle('Class Imbalance in Fraud Detection Dataset', fontsize=16, fontweight='bold', y=1.02)
370
+ plt.tight_layout()
371
+ return fig
372
+
373
+
374
+ def plot_model_comparison(all_metrics):
375
+ """
376
+ Bar chart comparing all 4 models across different metrics.
377
+ """
378
+ fig, ax = plt.subplots(figsize=(12, 6))
379
+
380
+ models = list(all_metrics.keys())
381
+ metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
382
+
383
+ x = np.arange(len(metrics))
384
+ width = 0.2
385
+
386
+ colors = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6']
387
+
388
+ for i, model in enumerate(models):
389
+ values = [all_metrics[model][m] for m in metrics]
390
+ ax.bar(x + i*width, values, width, label=model, color=colors[i])
391
+
392
+ ax.set_ylabel('Score', fontsize=12)
393
+ ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
394
+ ax.set_xticks(x + width * 1.5)
395
+ ax.set_xticklabels(metrics)
396
+ ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
397
+ ax.set_ylim([0, 1])
398
+ ax.grid(True, alpha=0.3, axis='y')
399
+
400
+ # Add value labels on bars
401
+ for i, model in enumerate(models):
402
+ values = [all_metrics[model][m] for m in metrics]
403
+ for j, v in enumerate(values):
404
+ ax.text(x[j] + i*width, v + 0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=8)
405
+
406
+ plt.tight_layout()
407
+ return fig
408
+
409
+
410
+ # ============================================================================
411
+ # GLOBAL VARIABLES (loaded once at startup)
412
+ # ============================================================================
413
+
414
+ print("Loading data...")
415
+ X_train, X_test, y_train, y_test, train_df, test_df = load_and_prepare_data()
416
+
417
+ print("Applying SMOTE to handle class imbalance...")
418
+ X_train_balanced, y_train_balanced = apply_smote(X_train, y_train)
419
+
420
+ print("Training models (this may take a moment)...")
421
+ models = get_models()
422
+ trained_models = {}
423
+ all_metrics = {}
424
+ all_predictions = {}
425
+ all_probabilities = {}
426
+
427
+ for name, model in models.items():
428
+ print(f" Training {name}...")
429
+ trained_models[name] = train_model(model, X_train_balanced, y_train_balanced)
430
+ y_pred, y_proba = evaluate_model(trained_models[name], X_test, y_test)
431
+ all_predictions[name] = y_pred
432
+ all_probabilities[name] = y_proba
433
+ all_metrics[name] = get_metrics(y_test, y_pred, y_proba)
434
+
435
+ print("Models trained successfully!")
436
+
437
+
438
+ # ============================================================================
439
+ # GRADIO INTERFACE FUNCTIONS
440
+ # ============================================================================
441
+
442
+ def get_data_overview():
443
+ """Return a summary of the dataset."""
444
+ summary = f"""
445
+ ## Dataset Overview
446
+
447
+ ### Training Data
448
+ - **Total Samples:** {len(train_df):,}
449
+ - **Fraud Cases:** {train_df['fraud'].sum():,} ({train_df['fraud'].mean()*100:.2f}%)
450
+ - **Legitimate Cases:** {(train_df['fraud']==0).sum():,} ({(1-train_df['fraud'].mean())*100:.2f}%)
451
+
452
+ ### Test Data
453
+ - **Total Samples:** {len(test_df):,}
454
+ - **Fraud Cases:** {test_df['fraud'].sum():,} ({test_df['fraud'].mean()*100:.2f}%)
455
+ - **Legitimate Cases:** {(test_df['fraud']==0).sum():,} ({(1-test_df['fraud'].mean())*100:.2f}%)
456
+
457
+ ### Features
458
+ - **Number of Features:** {X_train.shape[1]}
459
+ - **Feature Types:** All numeric (pre-processed and one-hot encoded)
460
+
461
+ ### Class Imbalance Handling
462
+ - Applied **SMOTE** (Synthetic Minority Over-sampling Technique)
463
+ - Training samples after SMOTE: {len(X_train_balanced):,}
464
+ """
465
+ return summary
466
+
467
+
468
+ def update_model_display(model_name):
469
+ """
470
+ Update all displays when a model is selected.
471
+ Returns metrics, classification report, and optimal threshold info.
472
+ """
473
+ metrics = all_metrics[model_name]
474
+ y_pred = all_predictions[model_name]
475
+ y_proba = all_probabilities[model_name]
476
+
477
+ # Get optimal threshold
478
+ best_thresh, best_f1, _, _ = find_optimal_threshold(y_test, y_proba)
479
+
480
+ # Create metrics display
481
+ metrics_text = f"""
482
+ ## {model_name} Performance Metrics
483
+
484
+ | Metric | Score |
485
+ |--------|-------|
486
+ | **Accuracy** | {metrics['Accuracy']:.4f} |
487
+ | **Precision** | {metrics['Precision']:.4f} |
488
+ | **Recall** | {metrics['Recall']:.4f} |
489
+ | **F1 Score** | {metrics['F1 Score']:.4f} |
490
+ | **ROC AUC** | {metrics['ROC AUC']:.4f} |
491
+
492
+ ### Threshold Optimization
493
+ - **Default Threshold:** 0.50
494
+ - **Optimal Threshold:** {best_thresh:.2f}
495
+ - **F1 at Optimal:** {best_f1:.4f}
496
+ """
497
+
498
+ # Classification report
499
+ report = classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud'])
500
+ report_text = f"```\n{report}\n```"
501
+
502
+ return metrics_text, report_text
503
+
504
+
505
+ def get_selected_plot(model_name, plot_type):
506
+ """
507
+ Generate the selected plot for the chosen model.
508
+ """
509
+ y_proba = all_probabilities[model_name]
510
+ y_pred = all_predictions[model_name]
511
+
512
+ if plot_type == "Precision-Recall Curve":
513
+ return plot_precision_recall_curve(y_test, y_proba, model_name)
514
+ elif plot_type == "ROC Curve":
515
+ return plot_roc_curve(y_test, y_proba, model_name)
516
+ elif plot_type == "Confusion Matrix":
517
+ return plot_confusion_matrix(y_test, y_pred, model_name)
518
+ elif plot_type == "Feature Importance":
519
+ return plot_feature_importance(trained_models[model_name], X_train.columns, model_name)
520
+ elif plot_type == "Threshold Analysis":
521
+ return plot_threshold_analysis(y_test, y_proba, model_name)
522
+ else:
523
+ return None
524
+
525
+
526
+ def get_comparison_results():
527
+ """Generate comparison table and plot."""
528
+ # Create comparison dataframe
529
+ comparison_df = pd.DataFrame(all_metrics).T
530
+ comparison_df = comparison_df.round(4)
531
+
532
+ # Find best model for each metric
533
+ best_models = comparison_df.idxmax()
534
+
535
+ summary = "## Model Comparison Summary\n\n"
536
+ summary += "| Metric | Best Model | Score |\n|--------|------------|-------|\n"
537
+ for metric in comparison_df.columns:
538
+ best = best_models[metric]
539
+ score = comparison_df.loc[best, metric]
540
+ summary += f"| {metric} | {best} | {score:.4f} |\n"
541
+
542
+ return comparison_df.to_markdown(), summary, plot_model_comparison(all_metrics)
543
+
544
+
545
+ def predict_single_claim(model_name, threshold, *feature_values):
546
+ """
547
+ Make prediction for a single claim using selected model and threshold.
548
+ """
549
+ model = trained_models[model_name]
550
+
551
+ # Create feature array
552
+ features = np.array(feature_values).reshape(1, -1)
553
+
554
+ # Get probability
555
+ proba = model.predict_proba(features)[0, 1]
556
+
557
+ # Apply threshold
558
+ prediction = 1 if proba >= threshold else 0
559
+
560
+ result = f"""
561
+ ## Prediction Result
562
+
563
+ **Model:** {model_name}
564
+ **Threshold:** {threshold:.2f}
565
+
566
+ ### Output
567
+ - **Fraud Probability:** {proba:.4f} ({proba*100:.2f}%)
568
+ - **Prediction:** {'🚨 FRAUDULENT' if prediction == 1 else '✅ LEGITIMATE'}
569
+
570
+ ### Interpretation
571
+ """
572
+ if prediction == 1:
573
+ result += "This claim has a high probability of being fraudulent and should be flagged for further investigation."
574
+ else:
575
+ result += "This claim appears to be legitimate based on the model's analysis."
576
+
577
+ return result
578
+
579
+
580
+ # ============================================================================
581
+ # GRADIO UI LAYOUT
582
+ # ============================================================================
583
+
584
+ # Create the Gradio interface
585
+ with gr.Blocks(title="Auto Insurance Fraud Detection", theme=gr.themes.Soft()) as demo:
586
+
587
+ gr.Markdown("""
588
+ # 🚗 Auto Insurance Claims Fraud Detection
589
+
590
+ This application demonstrates machine learning models for detecting fraudulent auto insurance claims.
591
+ The models are trained on historical claims data and can predict whether a new claim is likely to be fraudulent.
592
+
593
+ **Models Available:** XGBoost, LightGBM, Random Forest, Logistic Regression
594
+ """)
595
+
596
+ with gr.Tabs():
597
+ # Tab 1: Data Overview
598
+ with gr.TabItem("📊 Data Overview"):
599
+ gr.Markdown(get_data_overview())
600
+ with gr.Row():
601
+ dist_plot = gr.Plot(value=plot_class_distribution(train_df, test_df),
602
+ label="Class Distribution")
603
+
604
+ # Tab 2: Model Evaluation
605
+ with gr.TabItem("🎯 Model Evaluation"):
606
+ with gr.Row():
607
+ model_selector = gr.Dropdown(
608
+ choices=list(models.keys()),
609
+ value="XGBoost",
610
+ label="Select Model"
611
+ )
612
+ plot_selector = gr.Dropdown(
613
+ choices=["Precision-Recall Curve", "ROC Curve", "Confusion Matrix",
614
+ "Feature Importance", "Threshold Analysis"],
615
+ value="Precision-Recall Curve",
616
+ label="Select Visualization"
617
+ )
618
+
619
+ with gr.Row():
620
+ with gr.Column(scale=1):
621
+ metrics_display = gr.Markdown()
622
+ report_display = gr.Markdown()
623
+ with gr.Column(scale=2):
624
+ plot_display = gr.Plot()
625
+
626
+ # Update displays when model or plot changes
627
+ def update_all(model_name, plot_type):
628
+ metrics, report = update_model_display(model_name)
629
+ plot = get_selected_plot(model_name, plot_type)
630
+ return metrics, report, plot
631
+
632
+ model_selector.change(
633
+ fn=update_all,
634
+ inputs=[model_selector, plot_selector],
635
+ outputs=[metrics_display, report_display, plot_display]
636
+ )
637
+ plot_selector.change(
638
+ fn=update_all,
639
+ inputs=[model_selector, plot_selector],
640
+ outputs=[metrics_display, report_display, plot_display]
641
+ )
642
+
643
+ # Load initial values
644
+ demo.load(
645
+ fn=update_all,
646
+ inputs=[model_selector, plot_selector],
647
+ outputs=[metrics_display, report_display, plot_display]
648
+ )
649
+
650
+ # Tab 3: Model Comparison
651
+ with gr.TabItem("📈 Compare Models"):
652
+ gr.Markdown("## All Models Performance Comparison")
653
+
654
+ comparison_table, comparison_summary, comparison_plot = get_comparison_results()
655
+
656
+ gr.Markdown(comparison_summary)
657
+ gr.Markdown(comparison_table)
658
+ gr.Plot(value=comparison_plot, label="Model Comparison Chart")
659
+
660
+ # Tab 4: Threshold Analysis
661
+ with gr.TabItem("⚖️ Threshold Optimization"):
662
+ gr.Markdown("""
663
+ ## Finding the Optimal Classification Threshold
664
+
665
+ In fraud detection, the default 0.5 threshold isn't always optimal.
666
+ We need to balance:
667
+ - **Precision:** Not flagging legitimate claims as fraud (customer experience)
668
+ - **Recall:** Catching actual frauds (financial loss prevention)
669
+
670
+ The optimal threshold maximizes F1 score, which balances both concerns.
671
+ """)
672
+
673
+ thresh_model = gr.Dropdown(
674
+ choices=list(models.keys()),
675
+ value="XGBoost",
676
+ label="Select Model for Threshold Analysis"
677
+ )
678
+
679
+ thresh_plot = gr.Plot()
680
+
681
+ def update_threshold_plot(model_name):
682
+ y_proba = all_probabilities[model_name]
683
+ return plot_threshold_analysis(y_test, y_proba, model_name)
684
+
685
+ thresh_model.change(
686
+ fn=update_threshold_plot,
687
+ inputs=[thresh_model],
688
+ outputs=[thresh_plot]
689
+ )
690
+
691
+ demo.load(
692
+ fn=update_threshold_plot,
693
+ inputs=[thresh_model],
694
+ outputs=[thresh_plot]
695
+ )
696
+
697
+ # Show optimal thresholds for all models
698
+ thresh_summary = "### Optimal Thresholds by Model\n\n| Model | Optimal Threshold | F1 at Optimal |\n|-------|-------------------|---------------|\n"
699
+ for name in models.keys():
700
+ opt_thresh, opt_f1, _, _ = find_optimal_threshold(y_test, all_probabilities[name])
701
+ thresh_summary += f"| {name} | {opt_thresh:.2f} | {opt_f1:.4f} |\n"
702
+
703
+ gr.Markdown(thresh_summary)
704
+
705
+ # Tab 5: About
706
+ with gr.TabItem("���️ About"):
707
+ gr.Markdown("""
708
+ ## About This Project
709
+
710
+ ### Business Context
711
+ Auto insurance fraud costs the industry billions of dollars annually.
712
+ This project builds machine learning models to automatically flag potentially
713
+ fraudulent claims for further investigation.
714
+
715
+ ### Technical Approach
716
+ 1. **Data Preparation:** The dataset contains 46 features describing claims and customers
717
+ 2. **Class Imbalance:** Only ~3% of claims are fraudulent. We use SMOTE to balance the training data
718
+ 3. **Model Training:** Four different algorithms are compared
719
+ 4. **Evaluation:** Focus on Precision-Recall metrics due to class imbalance
720
+ 5. **Threshold Optimization:** Find the best cutoff for business needs
721
+
722
+ ### Models Used
723
+ - **XGBoost:** Gradient boosting with regularization, excellent for tabular data
724
+ - **LightGBM:** Fast gradient boosting, memory efficient
725
+ - **Random Forest:** Ensemble of decision trees, robust and interpretable
726
+ - **Logistic Regression:** Linear baseline model, highly interpretable
727
+
728
+ ### Key Metrics Explained
729
+ - **Precision:** Of claims flagged as fraud, how many are actually fraudulent
730
+ - **Recall:** Of actual frauds, how many did we catch
731
+ - **F1 Score:** Harmonic mean of precision and recall
732
+ - **ROC AUC:** Overall discrimination ability
733
+
734
+ ### Why Precision-Recall over ROC?
735
+ For highly imbalanced datasets like fraud detection, Precision-Recall curves
736
+ give a more realistic picture of model performance than ROC curves.
737
+ """)
738
+
739
+
740
+ # Launch the app
741
+ if __name__ == "__main__":
742
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Auto Insurance Fraud Detection - Dependencies
2
+ # For Hugging Face Spaces (CPU-only, Free Tier)
3
+
4
+ # Core ML Libraries
5
+ pandas==2.0.3
6
+ numpy==1.24.3
7
+ scikit-learn==1.3.0
8
+ xgboost==2.0.0
9
+ lightgbm==4.1.0
10
+ imbalanced-learn==0.11.0
11
+
12
+ # Visualization
13
+ matplotlib==3.7.2
14
+ seaborn==0.12.2
15
+
16
+ # Gradio for web interface
17
+ gradio==4.19.2
18
+
19
+ # Utilities
20
+ tabulate==0.9.0
test.csv ADDED
The diff for this file is too large to render. See raw diff
 
train.csv ADDED
The diff for this file is too large to render. See raw diff