Upload 5 files
Browse files- BEGINNER_GUIDE.md +360 -0
- app.py +742 -0
- requirements.txt +20 -0
- test.csv +0 -0
- train.csv +0 -0
BEGINNER_GUIDE.md
ADDED
|
@@ -0,0 +1,360 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🚀 Complete Beginner's Guide: Deploying Auto Insurance Fraud Detection on Hugging Face
|
| 2 |
+
|
| 3 |
+
This guide walks you through every step of setting up and running the fraud detection project. No prior experience with Hugging Face is required.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Table of Contents
|
| 8 |
+
1. [What You'll Need](#what-youll-need)
|
| 9 |
+
2. [Step 1: Create a Hugging Face Account](#step-1-create-a-hugging-face-account)
|
| 10 |
+
3. [Step 2: Create a New Space](#step-2-create-a-new-space)
|
| 11 |
+
4. [Step 3: Upload Your Files](#step-3-upload-your-files)
|
| 12 |
+
5. [Step 4: Wait for Build](#step-4-wait-for-build)
|
| 13 |
+
6. [Step 5: Use Your App](#step-5-use-your-app)
|
| 14 |
+
7. [Troubleshooting Common Issues](#troubleshooting-common-issues)
|
| 15 |
+
8. [Running Locally (Alternative)](#running-locally-alternative)
|
| 16 |
+
9. [Understanding the Output](#understanding-the-output)
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## What You'll Need
|
| 21 |
+
|
| 22 |
+
Before starting, make sure you have these 5 files ready in a folder on your computer:
|
| 23 |
+
|
| 24 |
+
| File | Description | Size (approx) |
|
| 25 |
+
|------|-------------|---------------|
|
| 26 |
+
| `app.py` | The main Python application code | ~25 KB |
|
| 27 |
+
| `requirements.txt` | List of Python packages needed | ~300 bytes |
|
| 28 |
+
| `train.csv` | Training dataset | ~1.9 MB |
|
| 29 |
+
| `test.csv` | Test dataset | ~470 KB |
|
| 30 |
+
| `README.md` | Documentation for the Space | ~4 KB |
|
| 31 |
+
|
| 32 |
+
You should also have:
|
| 33 |
+
- `report.docx` - The APA report (keep this separate, not uploaded to Hugging Face)
|
| 34 |
+
- This guide (`BEGINNER_GUIDE.md`) for reference
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Step 1: Create a Hugging Face Account
|
| 39 |
+
|
| 40 |
+
1. **Go to Hugging Face**: Open your browser and visit [https://huggingface.co](https://huggingface.co)
|
| 41 |
+
|
| 42 |
+
2. **Click "Sign Up"**: Look for the button in the top-right corner
|
| 43 |
+
|
| 44 |
+
3. **Fill in your details**:
|
| 45 |
+
- Username: Choose something memorable (e.g., `yourname_student`)
|
| 46 |
+
- Email: Use your school or personal email
|
| 47 |
+
- Password: Make it secure
|
| 48 |
+
|
| 49 |
+
4. **Verify your email**: Check your inbox and click the verification link
|
| 50 |
+
|
| 51 |
+
5. **Complete your profile** (optional but recommended)
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## Step 2: Create a New Space
|
| 56 |
+
|
| 57 |
+
Hugging Face "Spaces" are where you host web applications. Here's how to create one:
|
| 58 |
+
|
| 59 |
+
1. **Go to Spaces**: After logging in, click on your profile picture → "New Space"
|
| 60 |
+
|
| 61 |
+
Or directly visit: [https://huggingface.co/new-space](https://huggingface.co/new-space)
|
| 62 |
+
|
| 63 |
+
2. **Configure your Space**:
|
| 64 |
+
|
| 65 |
+
| Setting | What to Enter |
|
| 66 |
+
|---------|---------------|
|
| 67 |
+
| **Space name** | `fraud-detection` (or any name you like) |
|
| 68 |
+
| **License** | MIT (allows others to use your code) |
|
| 69 |
+
| **SDK** | Select **Gradio** |
|
| 70 |
+
| **SDK Version** | Leave as default (or select 4.19.2) |
|
| 71 |
+
| **Hardware** | **CPU basic** (free tier) |
|
| 72 |
+
| **Visibility** | Public (or Private if you prefer) |
|
| 73 |
+
|
| 74 |
+
3. **Click "Create Space"**
|
| 75 |
+
|
| 76 |
+
You now have an empty Space! It will show an error because there's no code yet—that's normal.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## Step 3: Upload Your Files
|
| 81 |
+
|
| 82 |
+
You have two options for uploading files:
|
| 83 |
+
|
| 84 |
+
### Option A: Web Interface (Easiest)
|
| 85 |
+
|
| 86 |
+
1. **Go to your Space**: Click on your Space name (e.g., `yourusername/fraud-detection`)
|
| 87 |
+
|
| 88 |
+
2. **Click "Files and versions"** tab
|
| 89 |
+
|
| 90 |
+
3. **Click "+ Add file"** → **"Upload files"**
|
| 91 |
+
|
| 92 |
+
4. **Upload these files one by one or all together**:
|
| 93 |
+
- `app.py`
|
| 94 |
+
- `requirements.txt`
|
| 95 |
+
- `train.csv`
|
| 96 |
+
- `test.csv`
|
| 97 |
+
- `README.md`
|
| 98 |
+
|
| 99 |
+
5. **Commit the changes**: After each upload (or batch), you'll see a "Commit" button. Click it.
|
| 100 |
+
|
| 101 |
+
⚠️ **Important**: Upload the README.md file. The one you created should replace any default README.
|
| 102 |
+
|
| 103 |
+
### Option B: Git (For more advanced users)
|
| 104 |
+
|
| 105 |
+
If you're familiar with Git, you can clone and push:
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
# Clone your space
|
| 109 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/fraud-detection
|
| 110 |
+
cd fraud-detection
|
| 111 |
+
|
| 112 |
+
# Copy your files into this folder
|
| 113 |
+
cp /path/to/your/files/* .
|
| 114 |
+
|
| 115 |
+
# Add, commit, and push
|
| 116 |
+
git add .
|
| 117 |
+
git commit -m "Initial upload of fraud detection app"
|
| 118 |
+
git push
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Step 4: Wait for Build
|
| 124 |
+
|
| 125 |
+
After uploading, Hugging Face automatically builds your app. Here's what happens:
|
| 126 |
+
|
| 127 |
+
1. **Building** (1-3 minutes): The status shows "Building"
|
| 128 |
+
- It installs packages from `requirements.txt`
|
| 129 |
+
- It prepares the environment
|
| 130 |
+
|
| 131 |
+
2. **Running** (3-5 minutes the first time): The status shows "Running"
|
| 132 |
+
- Your `app.py` code executes
|
| 133 |
+
- Models are trained
|
| 134 |
+
- The interface loads
|
| 135 |
+
|
| 136 |
+
3. **App Ready**: You'll see your app interface!
|
| 137 |
+
|
| 138 |
+
### What the logs show:
|
| 139 |
+
|
| 140 |
+
You can click "Logs" to see what's happening:
|
| 141 |
+
|
| 142 |
+
```
|
| 143 |
+
Loading data...
|
| 144 |
+
Applying SMOTE to handle class imbalance...
|
| 145 |
+
Training models (this may take a moment)...
|
| 146 |
+
Training XGBoost...
|
| 147 |
+
Training LightGBM...
|
| 148 |
+
Training Random Forest...
|
| 149 |
+
Training Logistic Regression...
|
| 150 |
+
Models trained successfully!
|
| 151 |
+
Running on local URL: http://0.0.0.0:7860
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
When you see that last line, your app is ready!
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## Step 5: Use Your App
|
| 159 |
+
|
| 160 |
+
Once your app is running, you'll see a interactive interface with 5 tabs:
|
| 161 |
+
|
| 162 |
+
### Tab 1: 📊 Data Overview
|
| 163 |
+
- Shows dataset statistics
|
| 164 |
+
- Displays class distribution pie charts
|
| 165 |
+
- Explains the imbalance problem
|
| 166 |
+
|
| 167 |
+
### Tab 2: ���� Model Evaluation
|
| 168 |
+
- **Select a model** from the dropdown (XGBoost, LightGBM, Random Forest, or Logistic Regression)
|
| 169 |
+
- **Select a visualization** to see:
|
| 170 |
+
- Precision-Recall Curve
|
| 171 |
+
- ROC Curve
|
| 172 |
+
- Confusion Matrix
|
| 173 |
+
- Feature Importance
|
| 174 |
+
- Threshold Analysis
|
| 175 |
+
- View performance metrics and classification report
|
| 176 |
+
|
| 177 |
+
### Tab 3: 📈 Compare Models
|
| 178 |
+
- Side-by-side comparison of all 4 models
|
| 179 |
+
- Bar chart showing metrics
|
| 180 |
+
- Table with best model for each metric
|
| 181 |
+
|
| 182 |
+
### Tab 4: ⚖️ Threshold Optimization
|
| 183 |
+
- Interactive plot showing precision/recall trade-off
|
| 184 |
+
- Table of optimal thresholds for each model
|
| 185 |
+
- Explains why 0.5 isn't always the best threshold
|
| 186 |
+
|
| 187 |
+
### Tab 5: ℹ️ About
|
| 188 |
+
- Project documentation
|
| 189 |
+
- Technical details
|
| 190 |
+
- Metrics explanations
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## Troubleshooting Common Issues
|
| 195 |
+
|
| 196 |
+
### Issue 1: "Application Error" or Build Fails
|
| 197 |
+
|
| 198 |
+
**Possible causes and solutions**:
|
| 199 |
+
|
| 200 |
+
| Problem | Solution |
|
| 201 |
+
|---------|----------|
|
| 202 |
+
| Missing file | Check all 5 files are uploaded |
|
| 203 |
+
| Wrong filename | Files must be exactly: `app.py`, `requirements.txt`, `train.csv`, `test.csv`, `README.md` |
|
| 204 |
+
| Corrupted CSV | Re-download the CSV files and upload again |
|
| 205 |
+
| Package conflict | Check the logs for specific error messages |
|
| 206 |
+
|
| 207 |
+
### Issue 2: "Out of Memory" Error
|
| 208 |
+
|
| 209 |
+
The free tier has limited memory. This shouldn't happen with our code, but if it does:
|
| 210 |
+
- Reduce `n_estimators` in the models (e.g., from 100 to 50)
|
| 211 |
+
- The app automatically uses efficient settings for free tier
|
| 212 |
+
|
| 213 |
+
### Issue 3: App Takes Too Long to Load
|
| 214 |
+
|
| 215 |
+
Normal behavior! The first load takes 3-5 minutes because:
|
| 216 |
+
- It needs to install packages
|
| 217 |
+
- It trains 4 machine learning models
|
| 218 |
+
- Subsequent visits are faster (cache helps)
|
| 219 |
+
|
| 220 |
+
### Issue 4: Graphs Don't Update
|
| 221 |
+
|
| 222 |
+
- Try clicking the dropdown again
|
| 223 |
+
- Refresh the page (Ctrl+R or Cmd+R)
|
| 224 |
+
- Wait a few seconds—processing takes time
|
| 225 |
+
|
| 226 |
+
### Issue 5: "No such file: train.csv"
|
| 227 |
+
|
| 228 |
+
The CSV files weren't uploaded correctly:
|
| 229 |
+
1. Go to "Files and versions"
|
| 230 |
+
2. Verify `train.csv` and `test.csv` are listed
|
| 231 |
+
3. If not, upload them again
|
| 232 |
+
4. Make sure filenames are lowercase
|
| 233 |
+
|
| 234 |
+
---
|
| 235 |
+
|
| 236 |
+
## Running Locally (Alternative)
|
| 237 |
+
|
| 238 |
+
If you prefer to run on your own computer instead of Hugging Face:
|
| 239 |
+
|
| 240 |
+
### Step 1: Install Python
|
| 241 |
+
Make sure you have Python 3.8+ installed. Check with:
|
| 242 |
+
```bash
|
| 243 |
+
python --version
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
### Step 2: Create a Project Folder
|
| 247 |
+
```bash
|
| 248 |
+
mkdir fraud_detection
|
| 249 |
+
cd fraud_detection
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
### Step 3: Copy All Files
|
| 253 |
+
Place these files in your folder:
|
| 254 |
+
- `app.py`
|
| 255 |
+
- `requirements.txt`
|
| 256 |
+
- `train.csv`
|
| 257 |
+
- `test.csv`
|
| 258 |
+
|
| 259 |
+
### Step 4: Create a Virtual Environment (Recommended)
|
| 260 |
+
```bash
|
| 261 |
+
# Create virtual environment
|
| 262 |
+
python -m venv venv
|
| 263 |
+
|
| 264 |
+
# Activate it
|
| 265 |
+
# On Windows:
|
| 266 |
+
venv\Scripts\activate
|
| 267 |
+
# On Mac/Linux:
|
| 268 |
+
source venv/bin/activate
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
### Step 5: Install Dependencies
|
| 272 |
+
```bash
|
| 273 |
+
pip install -r requirements.txt
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
This may take 2-5 minutes to download and install all packages.
|
| 277 |
+
|
| 278 |
+
### Step 6: Run the App
|
| 279 |
+
```bash
|
| 280 |
+
python app.py
|
| 281 |
+
```
|
| 282 |
+
|
| 283 |
+
### Step 7: Open in Browser
|
| 284 |
+
You'll see output like:
|
| 285 |
+
```
|
| 286 |
+
Running on local URL: http://127.0.0.1:7860
|
| 287 |
+
```
|
| 288 |
+
Open that URL in your browser.
|
| 289 |
+
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
## Understanding the Output
|
| 293 |
+
|
| 294 |
+
### What the Metrics Mean
|
| 295 |
+
|
| 296 |
+
| Metric | What It Tells You | Good Value |
|
| 297 |
+
|--------|------------------|------------|
|
| 298 |
+
| **Accuracy** | Overall correct predictions | >95% (but misleading for imbalanced data) |
|
| 299 |
+
| **Precision** | When we say "fraud", how often are we right? | >50% is good |
|
| 300 |
+
| **Recall** | Of all actual frauds, how many did we catch? | >70% is good |
|
| 301 |
+
| **F1 Score** | Balance of precision and recall | >0.5 is decent, >0.6 is good |
|
| 302 |
+
| **ROC AUC** | Overall discrimination ability | >0.9 is excellent |
|
| 303 |
+
|
| 304 |
+
### Reading the Confusion Matrix
|
| 305 |
+
|
| 306 |
+
```
|
| 307 |
+
Predicted
|
| 308 |
+
Legit Fraud
|
| 309 |
+
Actual Legit [TN] [FP]
|
| 310 |
+
Fraud [FN] [TP]
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
- **TN (True Negative)**: Correctly identified legitimate claims ✓
|
| 314 |
+
- **FP (False Positive)**: Legitimate claims wrongly flagged as fraud ✗
|
| 315 |
+
- **FN (False Negative)**: Frauds we missed ✗
|
| 316 |
+
- **TP (True Positive)**: Correctly caught frauds ✓
|
| 317 |
+
|
| 318 |
+
### Interpreting Feature Importance
|
| 319 |
+
|
| 320 |
+
The feature importance plot shows which variables most influence the model's decisions:
|
| 321 |
+
|
| 322 |
+
- **High importance features** = strong predictors of fraud
|
| 323 |
+
- For example, `total_claim_amount` being important means higher claims correlate with fraud
|
| 324 |
+
- Use this to understand what patterns the model learned
|
| 325 |
+
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
+
## File Checklist
|
| 329 |
+
|
| 330 |
+
Before deploying, verify you have all files:
|
| 331 |
+
|
| 332 |
+
- [ ] `app.py` - Main application (~650 lines of code)
|
| 333 |
+
- [ ] `requirements.txt` - Package list (11 packages)
|
| 334 |
+
- [ ] `train.csv` - 16,001 rows including header
|
| 335 |
+
- [ ] `test.csv` - 4,001 rows including header
|
| 336 |
+
- [ ] `README.md` - Documentation for the Space
|
| 337 |
+
|
| 338 |
+
Keep separately (don't upload to Hugging Face):
|
| 339 |
+
- [ ] `report.docx` - Your APA report for submission
|
| 340 |
+
- [ ] `BEGINNER_GUIDE.md` - This guide
|
| 341 |
+
|
| 342 |
+
---
|
| 343 |
+
|
| 344 |
+
## Tips for Success
|
| 345 |
+
|
| 346 |
+
1. **Be patient**: First build takes a few minutes
|
| 347 |
+
2. **Check logs**: They tell you exactly what's happening
|
| 348 |
+
3. **Verify uploads**: Make sure file sizes match expectations
|
| 349 |
+
4. **Use the right SDK**: Must be Gradio, not Streamlit
|
| 350 |
+
5. **Test locally first**: If something doesn't work, debug locally before deploying
|
| 351 |
+
|
| 352 |
+
---
|
| 353 |
+
|
| 354 |
+
## Need Help?
|
| 355 |
+
|
| 356 |
+
- **Hugging Face Documentation**: [https://huggingface.co/docs/hub/spaces](https://huggingface.co/docs/hub/spaces)
|
| 357 |
+
- **Gradio Guides**: [https://gradio.app/guides](https://gradio.app/guides)
|
| 358 |
+
- **Common Issues**: Check the "Troubleshooting" section above
|
| 359 |
+
|
| 360 |
+
Good luck with your fraud detection project! 🎉
|
app.py
ADDED
|
@@ -0,0 +1,742 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Auto Insurance Claims Fraud Detection
|
| 3 |
+
=====================================
|
| 4 |
+
A machine learning application that trains and compares 4 different models
|
| 5 |
+
for detecting fraudulent insurance claims.
|
| 6 |
+
|
| 7 |
+
Models: XGBoost, LightGBM, Random Forest, Logistic Regression
|
| 8 |
+
Author: Data Science Project
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import gradio as gr
|
| 12 |
+
import pandas as pd
|
| 13 |
+
import numpy as np
|
| 14 |
+
import matplotlib.pyplot as plt
|
| 15 |
+
import seaborn as sns
|
| 16 |
+
from io import BytesIO
|
| 17 |
+
import base64
|
| 18 |
+
import warnings
|
| 19 |
+
warnings.filterwarnings('ignore')
|
| 20 |
+
|
| 21 |
+
# ML Libraries
|
| 22 |
+
from sklearn.model_selection import cross_val_score
|
| 23 |
+
from sklearn.metrics import (
|
| 24 |
+
precision_recall_curve, roc_curve, auc,
|
| 25 |
+
confusion_matrix, classification_report,
|
| 26 |
+
f1_score, precision_score, recall_score, accuracy_score
|
| 27 |
+
)
|
| 28 |
+
from sklearn.linear_model import LogisticRegression
|
| 29 |
+
from sklearn.ensemble import RandomForestClassifier
|
| 30 |
+
from xgboost import XGBClassifier
|
| 31 |
+
from lightgbm import LGBMClassifier
|
| 32 |
+
from imblearn.over_sampling import SMOTE
|
| 33 |
+
|
| 34 |
+
# Set style for all plots - using try/except for compatibility
|
| 35 |
+
try:
|
| 36 |
+
plt.style.use('seaborn-v0_8-whitegrid')
|
| 37 |
+
except:
|
| 38 |
+
try:
|
| 39 |
+
plt.style.use('seaborn-whitegrid')
|
| 40 |
+
except:
|
| 41 |
+
plt.style.use('ggplot') # Fallback style
|
| 42 |
+
sns.set_palette("husl")
|
| 43 |
+
|
| 44 |
+
# ============================================================================
|
| 45 |
+
# DATA LOADING AND PREPROCESSING
|
| 46 |
+
# ============================================================================
|
| 47 |
+
|
| 48 |
+
def load_and_prepare_data():
|
| 49 |
+
"""
|
| 50 |
+
Load the train and test datasets.
|
| 51 |
+
The data is already preprocessed and one-hot encoded.
|
| 52 |
+
"""
|
| 53 |
+
# Load datasets
|
| 54 |
+
train_df = pd.read_csv('train.csv')
|
| 55 |
+
test_df = pd.read_csv('test.csv')
|
| 56 |
+
|
| 57 |
+
# Separate features and target
|
| 58 |
+
# 'fraud' is our target variable (0 = legitimate, 1 = fraudulent)
|
| 59 |
+
X_train = train_df.drop('fraud', axis=1)
|
| 60 |
+
y_train = train_df['fraud']
|
| 61 |
+
X_test = test_df.drop('fraud', axis=1)
|
| 62 |
+
y_test = test_df['fraud']
|
| 63 |
+
|
| 64 |
+
return X_train, X_test, y_train, y_test, train_df, test_df
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def apply_smote(X_train, y_train):
|
| 68 |
+
"""
|
| 69 |
+
Apply SMOTE to handle class imbalance.
|
| 70 |
+
Fraud cases are rare (~3%), so we oversample the minority class.
|
| 71 |
+
"""
|
| 72 |
+
smote = SMOTE(random_state=42)
|
| 73 |
+
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
|
| 74 |
+
return X_resampled, y_resampled
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
# ============================================================================
|
| 78 |
+
# MODEL DEFINITIONS
|
| 79 |
+
# ============================================================================
|
| 80 |
+
|
| 81 |
+
def get_models():
|
| 82 |
+
"""
|
| 83 |
+
Define the 4 models we'll compare.
|
| 84 |
+
Each model is tuned for imbalanced fraud detection.
|
| 85 |
+
"""
|
| 86 |
+
models = {
|
| 87 |
+
'XGBoost': XGBClassifier(
|
| 88 |
+
n_estimators=100,
|
| 89 |
+
max_depth=4,
|
| 90 |
+
learning_rate=0.1,
|
| 91 |
+
scale_pos_weight=10, # Helps with imbalanced data
|
| 92 |
+
random_state=42,
|
| 93 |
+
use_label_encoder=False,
|
| 94 |
+
eval_metric='logloss'
|
| 95 |
+
),
|
| 96 |
+
'LightGBM': LGBMClassifier(
|
| 97 |
+
n_estimators=100,
|
| 98 |
+
max_depth=4,
|
| 99 |
+
learning_rate=0.1,
|
| 100 |
+
class_weight='balanced', # Handles imbalance internally
|
| 101 |
+
random_state=42,
|
| 102 |
+
verbose=-1
|
| 103 |
+
),
|
| 104 |
+
'Random Forest': RandomForestClassifier(
|
| 105 |
+
n_estimators=100,
|
| 106 |
+
max_depth=6,
|
| 107 |
+
class_weight='balanced',
|
| 108 |
+
random_state=42,
|
| 109 |
+
n_jobs=-1
|
| 110 |
+
),
|
| 111 |
+
'Logistic Regression': LogisticRegression(
|
| 112 |
+
class_weight='balanced',
|
| 113 |
+
max_iter=1000,
|
| 114 |
+
random_state=42
|
| 115 |
+
)
|
| 116 |
+
}
|
| 117 |
+
return models
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
# ============================================================================
|
| 121 |
+
# MODEL TRAINING AND EVALUATION
|
| 122 |
+
# ============================================================================
|
| 123 |
+
|
| 124 |
+
def train_model(model, X_train, y_train):
|
| 125 |
+
"""Train a single model and return the fitted model."""
|
| 126 |
+
model.fit(X_train, y_train)
|
| 127 |
+
return model
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
def evaluate_model(model, X_test, y_test):
|
| 131 |
+
"""
|
| 132 |
+
Get predictions and probabilities from a trained model.
|
| 133 |
+
Returns both hard predictions and probability scores.
|
| 134 |
+
"""
|
| 135 |
+
y_pred = model.predict(X_test)
|
| 136 |
+
y_proba = model.predict_proba(X_test)[:, 1] # Probability of fraud
|
| 137 |
+
return y_pred, y_proba
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def get_metrics(y_test, y_pred, y_proba):
|
| 141 |
+
"""
|
| 142 |
+
Calculate all relevant metrics for fraud detection.
|
| 143 |
+
For imbalanced data, we focus on Precision, Recall, and F1.
|
| 144 |
+
"""
|
| 145 |
+
metrics = {
|
| 146 |
+
'Accuracy': accuracy_score(y_test, y_pred),
|
| 147 |
+
'Precision': precision_score(y_test, y_pred, zero_division=0),
|
| 148 |
+
'Recall': recall_score(y_test, y_pred, zero_division=0),
|
| 149 |
+
'F1 Score': f1_score(y_test, y_pred, zero_division=0),
|
| 150 |
+
'ROC AUC': auc(*roc_curve(y_test, y_proba)[:2])
|
| 151 |
+
}
|
| 152 |
+
return metrics
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def find_optimal_threshold(y_test, y_proba):
|
| 156 |
+
"""
|
| 157 |
+
Find the optimal classification threshold using F1 score.
|
| 158 |
+
Default threshold is 0.5, but for imbalanced data,
|
| 159 |
+
a different threshold often works better.
|
| 160 |
+
"""
|
| 161 |
+
thresholds = np.arange(0.1, 0.9, 0.01)
|
| 162 |
+
f1_scores = []
|
| 163 |
+
|
| 164 |
+
for thresh in thresholds:
|
| 165 |
+
y_pred_thresh = (y_proba >= thresh).astype(int)
|
| 166 |
+
f1 = f1_score(y_test, y_pred_thresh, zero_division=0)
|
| 167 |
+
f1_scores.append(f1)
|
| 168 |
+
|
| 169 |
+
# Find threshold with best F1 score
|
| 170 |
+
best_idx = np.argmax(f1_scores)
|
| 171 |
+
best_threshold = thresholds[best_idx]
|
| 172 |
+
best_f1 = f1_scores[best_idx]
|
| 173 |
+
|
| 174 |
+
return best_threshold, best_f1, thresholds, f1_scores
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
# ============================================================================
|
| 178 |
+
# VISUALIZATION FUNCTIONS
|
| 179 |
+
# ============================================================================
|
| 180 |
+
|
| 181 |
+
def plot_precision_recall_curve(y_test, y_proba, model_name):
|
| 182 |
+
"""
|
| 183 |
+
Plot Precision-Recall curve.
|
| 184 |
+
This is the most important metric for fraud detection because
|
| 185 |
+
we care about catching frauds (recall) without too many false alarms (precision).
|
| 186 |
+
"""
|
| 187 |
+
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
|
| 188 |
+
pr_auc = auc(recall, precision)
|
| 189 |
+
|
| 190 |
+
fig, ax = plt.subplots(figsize=(8, 6))
|
| 191 |
+
ax.plot(recall, precision, 'b-', linewidth=2, label=f'{model_name} (AUC = {pr_auc:.3f})')
|
| 192 |
+
ax.fill_between(recall, precision, alpha=0.2)
|
| 193 |
+
|
| 194 |
+
# Add baseline (random classifier)
|
| 195 |
+
baseline = y_test.mean()
|
| 196 |
+
ax.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline = {baseline:.3f}')
|
| 197 |
+
|
| 198 |
+
ax.set_xlabel('Recall (Fraud Detection Rate)', fontsize=12)
|
| 199 |
+
ax.set_ylabel('Precision (True Fraud Rate)', fontsize=12)
|
| 200 |
+
ax.set_title(f'Precision-Recall Curve: {model_name}', fontsize=14, fontweight='bold')
|
| 201 |
+
ax.legend(loc='best')
|
| 202 |
+
ax.set_xlim([0, 1])
|
| 203 |
+
ax.set_ylim([0, 1])
|
| 204 |
+
ax.grid(True, alpha=0.3)
|
| 205 |
+
|
| 206 |
+
plt.tight_layout()
|
| 207 |
+
return fig
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def plot_roc_curve(y_test, y_proba, model_name):
|
| 211 |
+
"""
|
| 212 |
+
Plot ROC curve showing true positive rate vs false positive rate.
|
| 213 |
+
AUC closer to 1 means better discrimination between fraud and legitimate claims.
|
| 214 |
+
"""
|
| 215 |
+
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
|
| 216 |
+
roc_auc = auc(fpr, tpr)
|
| 217 |
+
|
| 218 |
+
fig, ax = plt.subplots(figsize=(8, 6))
|
| 219 |
+
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'{model_name} (AUC = {roc_auc:.3f})')
|
| 220 |
+
ax.fill_between(fpr, tpr, alpha=0.2)
|
| 221 |
+
|
| 222 |
+
# Random classifier line
|
| 223 |
+
ax.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
|
| 224 |
+
|
| 225 |
+
ax.set_xlabel('False Positive Rate', fontsize=12)
|
| 226 |
+
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
|
| 227 |
+
ax.set_title(f'ROC Curve: {model_name}', fontsize=14, fontweight='bold')
|
| 228 |
+
ax.legend(loc='lower right')
|
| 229 |
+
ax.set_xlim([0, 1])
|
| 230 |
+
ax.set_ylim([0, 1])
|
| 231 |
+
ax.grid(True, alpha=0.3)
|
| 232 |
+
|
| 233 |
+
plt.tight_layout()
|
| 234 |
+
return fig
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def plot_confusion_matrix(y_test, y_pred, model_name):
|
| 238 |
+
"""
|
| 239 |
+
Plot confusion matrix as a heatmap.
|
| 240 |
+
Shows: True Negatives, False Positives, False Negatives, True Positives.
|
| 241 |
+
"""
|
| 242 |
+
cm = confusion_matrix(y_test, y_pred)
|
| 243 |
+
|
| 244 |
+
fig, ax = plt.subplots(figsize=(8, 6))
|
| 245 |
+
|
| 246 |
+
# Create heatmap with custom colors
|
| 247 |
+
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
|
| 248 |
+
xticklabels=['Legitimate', 'Fraud'],
|
| 249 |
+
yticklabels=['Legitimate', 'Fraud'],
|
| 250 |
+
annot_kws={'size': 16})
|
| 251 |
+
|
| 252 |
+
ax.set_xlabel('Predicted Label', fontsize=12)
|
| 253 |
+
ax.set_ylabel('True Label', fontsize=12)
|
| 254 |
+
ax.set_title(f'Confusion Matrix: {model_name}', fontsize=14, fontweight='bold')
|
| 255 |
+
|
| 256 |
+
# Add text annotations explaining the quadrants
|
| 257 |
+
total = cm.sum()
|
| 258 |
+
tn, fp, fn, tp = cm.ravel()
|
| 259 |
+
|
| 260 |
+
text = f"TN: {tn} | FP: {fp}\nFN: {fn} | TP: {tp}"
|
| 261 |
+
ax.text(1.4, 0.5, text, transform=ax.transAxes, fontsize=10,
|
| 262 |
+
verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
|
| 263 |
+
|
| 264 |
+
plt.tight_layout()
|
| 265 |
+
return fig
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
def plot_feature_importance(model, feature_names, model_name):
|
| 269 |
+
"""
|
| 270 |
+
Plot top 15 most important features.
|
| 271 |
+
Different models calculate importance differently:
|
| 272 |
+
- Tree models: based on split gain
|
| 273 |
+
- Logistic Regression: based on coefficient magnitude
|
| 274 |
+
"""
|
| 275 |
+
fig, ax = plt.subplots(figsize=(10, 8))
|
| 276 |
+
|
| 277 |
+
# Get feature importances based on model type
|
| 278 |
+
if hasattr(model, 'feature_importances_'):
|
| 279 |
+
importances = model.feature_importances_
|
| 280 |
+
elif hasattr(model, 'coef_'):
|
| 281 |
+
importances = np.abs(model.coef_[0])
|
| 282 |
+
else:
|
| 283 |
+
# Fallback: return empty plot
|
| 284 |
+
ax.text(0.5, 0.5, 'Feature importance not available',
|
| 285 |
+
ha='center', va='center', fontsize=14)
|
| 286 |
+
return fig
|
| 287 |
+
|
| 288 |
+
# Create dataframe and sort by importance
|
| 289 |
+
importance_df = pd.DataFrame({
|
| 290 |
+
'Feature': feature_names,
|
| 291 |
+
'Importance': importances
|
| 292 |
+
}).sort_values('Importance', ascending=True).tail(15)
|
| 293 |
+
|
| 294 |
+
# Horizontal bar chart
|
| 295 |
+
colors = plt.cm.Blues(np.linspace(0.4, 0.8, len(importance_df)))
|
| 296 |
+
ax.barh(importance_df['Feature'], importance_df['Importance'], color=colors)
|
| 297 |
+
|
| 298 |
+
ax.set_xlabel('Importance Score', fontsize=12)
|
| 299 |
+
ax.set_title(f'Top 15 Feature Importances: {model_name}', fontsize=14, fontweight='bold')
|
| 300 |
+
ax.grid(True, alpha=0.3, axis='x')
|
| 301 |
+
|
| 302 |
+
plt.tight_layout()
|
| 303 |
+
return fig
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
def plot_threshold_analysis(y_test, y_proba, model_name):
|
| 307 |
+
"""
|
| 308 |
+
Plot how different thresholds affect precision, recall, and F1.
|
| 309 |
+
Helps visualize the trade-off and find the optimal threshold.
|
| 310 |
+
"""
|
| 311 |
+
thresholds = np.arange(0.05, 0.95, 0.01)
|
| 312 |
+
precisions = []
|
| 313 |
+
recalls = []
|
| 314 |
+
f1_scores = []
|
| 315 |
+
|
| 316 |
+
for thresh in thresholds:
|
| 317 |
+
y_pred_thresh = (y_proba >= thresh).astype(int)
|
| 318 |
+
precisions.append(precision_score(y_test, y_pred_thresh, zero_division=0))
|
| 319 |
+
recalls.append(recall_score(y_test, y_pred_thresh, zero_division=0))
|
| 320 |
+
f1_scores.append(f1_score(y_test, y_pred_thresh, zero_division=0))
|
| 321 |
+
|
| 322 |
+
# Find optimal threshold
|
| 323 |
+
best_idx = np.argmax(f1_scores)
|
| 324 |
+
best_threshold = thresholds[best_idx]
|
| 325 |
+
|
| 326 |
+
fig, ax = plt.subplots(figsize=(10, 6))
|
| 327 |
+
|
| 328 |
+
ax.plot(thresholds, precisions, 'b-', linewidth=2, label='Precision')
|
| 329 |
+
ax.plot(thresholds, recalls, 'g-', linewidth=2, label='Recall')
|
| 330 |
+
ax.plot(thresholds, f1_scores, 'r-', linewidth=2, label='F1 Score')
|
| 331 |
+
|
| 332 |
+
# Mark optimal threshold
|
| 333 |
+
ax.axvline(x=best_threshold, color='purple', linestyle='--',
|
| 334 |
+
label=f'Optimal Threshold = {best_threshold:.2f}')
|
| 335 |
+
ax.axvline(x=0.5, color='gray', linestyle=':', alpha=0.7, label='Default (0.5)')
|
| 336 |
+
|
| 337 |
+
ax.set_xlabel('Classification Threshold', fontsize=12)
|
| 338 |
+
ax.set_ylabel('Score', fontsize=12)
|
| 339 |
+
ax.set_title(f'Threshold Analysis: {model_name}', fontsize=14, fontweight='bold')
|
| 340 |
+
ax.legend(loc='best')
|
| 341 |
+
ax.set_xlim([0, 1])
|
| 342 |
+
ax.set_ylim([0, 1])
|
| 343 |
+
ax.grid(True, alpha=0.3)
|
| 344 |
+
|
| 345 |
+
plt.tight_layout()
|
| 346 |
+
return fig
|
| 347 |
+
|
| 348 |
+
|
| 349 |
+
def plot_class_distribution(train_df, test_df):
|
| 350 |
+
"""
|
| 351 |
+
Plot the class distribution showing imbalance.
|
| 352 |
+
Fraud is rare (~3%) which is typical in real-world scenarios.
|
| 353 |
+
"""
|
| 354 |
+
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
|
| 355 |
+
|
| 356 |
+
# Training data distribution
|
| 357 |
+
train_counts = train_df['fraud'].value_counts()
|
| 358 |
+
colors = ['#2ecc71', '#e74c3c']
|
| 359 |
+
axes[0].pie(train_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
|
| 360 |
+
colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
|
| 361 |
+
axes[0].set_title('Training Data Distribution', fontsize=14, fontweight='bold')
|
| 362 |
+
|
| 363 |
+
# Test data distribution
|
| 364 |
+
test_counts = test_df['fraud'].value_counts()
|
| 365 |
+
axes[1].pie(test_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
|
| 366 |
+
colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
|
| 367 |
+
axes[1].set_title('Test Data Distribution', fontsize=14, fontweight='bold')
|
| 368 |
+
|
| 369 |
+
plt.suptitle('Class Imbalance in Fraud Detection Dataset', fontsize=16, fontweight='bold', y=1.02)
|
| 370 |
+
plt.tight_layout()
|
| 371 |
+
return fig
|
| 372 |
+
|
| 373 |
+
|
| 374 |
+
def plot_model_comparison(all_metrics):
|
| 375 |
+
"""
|
| 376 |
+
Bar chart comparing all 4 models across different metrics.
|
| 377 |
+
"""
|
| 378 |
+
fig, ax = plt.subplots(figsize=(12, 6))
|
| 379 |
+
|
| 380 |
+
models = list(all_metrics.keys())
|
| 381 |
+
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
|
| 382 |
+
|
| 383 |
+
x = np.arange(len(metrics))
|
| 384 |
+
width = 0.2
|
| 385 |
+
|
| 386 |
+
colors = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6']
|
| 387 |
+
|
| 388 |
+
for i, model in enumerate(models):
|
| 389 |
+
values = [all_metrics[model][m] for m in metrics]
|
| 390 |
+
ax.bar(x + i*width, values, width, label=model, color=colors[i])
|
| 391 |
+
|
| 392 |
+
ax.set_ylabel('Score', fontsize=12)
|
| 393 |
+
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
|
| 394 |
+
ax.set_xticks(x + width * 1.5)
|
| 395 |
+
ax.set_xticklabels(metrics)
|
| 396 |
+
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
|
| 397 |
+
ax.set_ylim([0, 1])
|
| 398 |
+
ax.grid(True, alpha=0.3, axis='y')
|
| 399 |
+
|
| 400 |
+
# Add value labels on bars
|
| 401 |
+
for i, model in enumerate(models):
|
| 402 |
+
values = [all_metrics[model][m] for m in metrics]
|
| 403 |
+
for j, v in enumerate(values):
|
| 404 |
+
ax.text(x[j] + i*width, v + 0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=8)
|
| 405 |
+
|
| 406 |
+
plt.tight_layout()
|
| 407 |
+
return fig
|
| 408 |
+
|
| 409 |
+
|
| 410 |
+
# ============================================================================
|
| 411 |
+
# GLOBAL VARIABLES (loaded once at startup)
|
| 412 |
+
# ============================================================================
|
| 413 |
+
|
| 414 |
+
print("Loading data...")
|
| 415 |
+
X_train, X_test, y_train, y_test, train_df, test_df = load_and_prepare_data()
|
| 416 |
+
|
| 417 |
+
print("Applying SMOTE to handle class imbalance...")
|
| 418 |
+
X_train_balanced, y_train_balanced = apply_smote(X_train, y_train)
|
| 419 |
+
|
| 420 |
+
print("Training models (this may take a moment)...")
|
| 421 |
+
models = get_models()
|
| 422 |
+
trained_models = {}
|
| 423 |
+
all_metrics = {}
|
| 424 |
+
all_predictions = {}
|
| 425 |
+
all_probabilities = {}
|
| 426 |
+
|
| 427 |
+
for name, model in models.items():
|
| 428 |
+
print(f" Training {name}...")
|
| 429 |
+
trained_models[name] = train_model(model, X_train_balanced, y_train_balanced)
|
| 430 |
+
y_pred, y_proba = evaluate_model(trained_models[name], X_test, y_test)
|
| 431 |
+
all_predictions[name] = y_pred
|
| 432 |
+
all_probabilities[name] = y_proba
|
| 433 |
+
all_metrics[name] = get_metrics(y_test, y_pred, y_proba)
|
| 434 |
+
|
| 435 |
+
print("Models trained successfully!")
|
| 436 |
+
|
| 437 |
+
|
| 438 |
+
# ============================================================================
|
| 439 |
+
# GRADIO INTERFACE FUNCTIONS
|
| 440 |
+
# ============================================================================
|
| 441 |
+
|
| 442 |
+
def get_data_overview():
|
| 443 |
+
"""Return a summary of the dataset."""
|
| 444 |
+
summary = f"""
|
| 445 |
+
## Dataset Overview
|
| 446 |
+
|
| 447 |
+
### Training Data
|
| 448 |
+
- **Total Samples:** {len(train_df):,}
|
| 449 |
+
- **Fraud Cases:** {train_df['fraud'].sum():,} ({train_df['fraud'].mean()*100:.2f}%)
|
| 450 |
+
- **Legitimate Cases:** {(train_df['fraud']==0).sum():,} ({(1-train_df['fraud'].mean())*100:.2f}%)
|
| 451 |
+
|
| 452 |
+
### Test Data
|
| 453 |
+
- **Total Samples:** {len(test_df):,}
|
| 454 |
+
- **Fraud Cases:** {test_df['fraud'].sum():,} ({test_df['fraud'].mean()*100:.2f}%)
|
| 455 |
+
- **Legitimate Cases:** {(test_df['fraud']==0).sum():,} ({(1-test_df['fraud'].mean())*100:.2f}%)
|
| 456 |
+
|
| 457 |
+
### Features
|
| 458 |
+
- **Number of Features:** {X_train.shape[1]}
|
| 459 |
+
- **Feature Types:** All numeric (pre-processed and one-hot encoded)
|
| 460 |
+
|
| 461 |
+
### Class Imbalance Handling
|
| 462 |
+
- Applied **SMOTE** (Synthetic Minority Over-sampling Technique)
|
| 463 |
+
- Training samples after SMOTE: {len(X_train_balanced):,}
|
| 464 |
+
"""
|
| 465 |
+
return summary
|
| 466 |
+
|
| 467 |
+
|
| 468 |
+
def update_model_display(model_name):
|
| 469 |
+
"""
|
| 470 |
+
Update all displays when a model is selected.
|
| 471 |
+
Returns metrics, classification report, and optimal threshold info.
|
| 472 |
+
"""
|
| 473 |
+
metrics = all_metrics[model_name]
|
| 474 |
+
y_pred = all_predictions[model_name]
|
| 475 |
+
y_proba = all_probabilities[model_name]
|
| 476 |
+
|
| 477 |
+
# Get optimal threshold
|
| 478 |
+
best_thresh, best_f1, _, _ = find_optimal_threshold(y_test, y_proba)
|
| 479 |
+
|
| 480 |
+
# Create metrics display
|
| 481 |
+
metrics_text = f"""
|
| 482 |
+
## {model_name} Performance Metrics
|
| 483 |
+
|
| 484 |
+
| Metric | Score |
|
| 485 |
+
|--------|-------|
|
| 486 |
+
| **Accuracy** | {metrics['Accuracy']:.4f} |
|
| 487 |
+
| **Precision** | {metrics['Precision']:.4f} |
|
| 488 |
+
| **Recall** | {metrics['Recall']:.4f} |
|
| 489 |
+
| **F1 Score** | {metrics['F1 Score']:.4f} |
|
| 490 |
+
| **ROC AUC** | {metrics['ROC AUC']:.4f} |
|
| 491 |
+
|
| 492 |
+
### Threshold Optimization
|
| 493 |
+
- **Default Threshold:** 0.50
|
| 494 |
+
- **Optimal Threshold:** {best_thresh:.2f}
|
| 495 |
+
- **F1 at Optimal:** {best_f1:.4f}
|
| 496 |
+
"""
|
| 497 |
+
|
| 498 |
+
# Classification report
|
| 499 |
+
report = classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud'])
|
| 500 |
+
report_text = f"```\n{report}\n```"
|
| 501 |
+
|
| 502 |
+
return metrics_text, report_text
|
| 503 |
+
|
| 504 |
+
|
| 505 |
+
def get_selected_plot(model_name, plot_type):
|
| 506 |
+
"""
|
| 507 |
+
Generate the selected plot for the chosen model.
|
| 508 |
+
"""
|
| 509 |
+
y_proba = all_probabilities[model_name]
|
| 510 |
+
y_pred = all_predictions[model_name]
|
| 511 |
+
|
| 512 |
+
if plot_type == "Precision-Recall Curve":
|
| 513 |
+
return plot_precision_recall_curve(y_test, y_proba, model_name)
|
| 514 |
+
elif plot_type == "ROC Curve":
|
| 515 |
+
return plot_roc_curve(y_test, y_proba, model_name)
|
| 516 |
+
elif plot_type == "Confusion Matrix":
|
| 517 |
+
return plot_confusion_matrix(y_test, y_pred, model_name)
|
| 518 |
+
elif plot_type == "Feature Importance":
|
| 519 |
+
return plot_feature_importance(trained_models[model_name], X_train.columns, model_name)
|
| 520 |
+
elif plot_type == "Threshold Analysis":
|
| 521 |
+
return plot_threshold_analysis(y_test, y_proba, model_name)
|
| 522 |
+
else:
|
| 523 |
+
return None
|
| 524 |
+
|
| 525 |
+
|
| 526 |
+
def get_comparison_results():
|
| 527 |
+
"""Generate comparison table and plot."""
|
| 528 |
+
# Create comparison dataframe
|
| 529 |
+
comparison_df = pd.DataFrame(all_metrics).T
|
| 530 |
+
comparison_df = comparison_df.round(4)
|
| 531 |
+
|
| 532 |
+
# Find best model for each metric
|
| 533 |
+
best_models = comparison_df.idxmax()
|
| 534 |
+
|
| 535 |
+
summary = "## Model Comparison Summary\n\n"
|
| 536 |
+
summary += "| Metric | Best Model | Score |\n|--------|------------|-------|\n"
|
| 537 |
+
for metric in comparison_df.columns:
|
| 538 |
+
best = best_models[metric]
|
| 539 |
+
score = comparison_df.loc[best, metric]
|
| 540 |
+
summary += f"| {metric} | {best} | {score:.4f} |\n"
|
| 541 |
+
|
| 542 |
+
return comparison_df.to_markdown(), summary, plot_model_comparison(all_metrics)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
def predict_single_claim(model_name, threshold, *feature_values):
|
| 546 |
+
"""
|
| 547 |
+
Make prediction for a single claim using selected model and threshold.
|
| 548 |
+
"""
|
| 549 |
+
model = trained_models[model_name]
|
| 550 |
+
|
| 551 |
+
# Create feature array
|
| 552 |
+
features = np.array(feature_values).reshape(1, -1)
|
| 553 |
+
|
| 554 |
+
# Get probability
|
| 555 |
+
proba = model.predict_proba(features)[0, 1]
|
| 556 |
+
|
| 557 |
+
# Apply threshold
|
| 558 |
+
prediction = 1 if proba >= threshold else 0
|
| 559 |
+
|
| 560 |
+
result = f"""
|
| 561 |
+
## Prediction Result
|
| 562 |
+
|
| 563 |
+
**Model:** {model_name}
|
| 564 |
+
**Threshold:** {threshold:.2f}
|
| 565 |
+
|
| 566 |
+
### Output
|
| 567 |
+
- **Fraud Probability:** {proba:.4f} ({proba*100:.2f}%)
|
| 568 |
+
- **Prediction:** {'🚨 FRAUDULENT' if prediction == 1 else '✅ LEGITIMATE'}
|
| 569 |
+
|
| 570 |
+
### Interpretation
|
| 571 |
+
"""
|
| 572 |
+
if prediction == 1:
|
| 573 |
+
result += "This claim has a high probability of being fraudulent and should be flagged for further investigation."
|
| 574 |
+
else:
|
| 575 |
+
result += "This claim appears to be legitimate based on the model's analysis."
|
| 576 |
+
|
| 577 |
+
return result
|
| 578 |
+
|
| 579 |
+
|
| 580 |
+
# ============================================================================
|
| 581 |
+
# GRADIO UI LAYOUT
|
| 582 |
+
# ============================================================================
|
| 583 |
+
|
| 584 |
+
# Create the Gradio interface
|
| 585 |
+
with gr.Blocks(title="Auto Insurance Fraud Detection", theme=gr.themes.Soft()) as demo:
|
| 586 |
+
|
| 587 |
+
gr.Markdown("""
|
| 588 |
+
# 🚗 Auto Insurance Claims Fraud Detection
|
| 589 |
+
|
| 590 |
+
This application demonstrates machine learning models for detecting fraudulent auto insurance claims.
|
| 591 |
+
The models are trained on historical claims data and can predict whether a new claim is likely to be fraudulent.
|
| 592 |
+
|
| 593 |
+
**Models Available:** XGBoost, LightGBM, Random Forest, Logistic Regression
|
| 594 |
+
""")
|
| 595 |
+
|
| 596 |
+
with gr.Tabs():
|
| 597 |
+
# Tab 1: Data Overview
|
| 598 |
+
with gr.TabItem("📊 Data Overview"):
|
| 599 |
+
gr.Markdown(get_data_overview())
|
| 600 |
+
with gr.Row():
|
| 601 |
+
dist_plot = gr.Plot(value=plot_class_distribution(train_df, test_df),
|
| 602 |
+
label="Class Distribution")
|
| 603 |
+
|
| 604 |
+
# Tab 2: Model Evaluation
|
| 605 |
+
with gr.TabItem("🎯 Model Evaluation"):
|
| 606 |
+
with gr.Row():
|
| 607 |
+
model_selector = gr.Dropdown(
|
| 608 |
+
choices=list(models.keys()),
|
| 609 |
+
value="XGBoost",
|
| 610 |
+
label="Select Model"
|
| 611 |
+
)
|
| 612 |
+
plot_selector = gr.Dropdown(
|
| 613 |
+
choices=["Precision-Recall Curve", "ROC Curve", "Confusion Matrix",
|
| 614 |
+
"Feature Importance", "Threshold Analysis"],
|
| 615 |
+
value="Precision-Recall Curve",
|
| 616 |
+
label="Select Visualization"
|
| 617 |
+
)
|
| 618 |
+
|
| 619 |
+
with gr.Row():
|
| 620 |
+
with gr.Column(scale=1):
|
| 621 |
+
metrics_display = gr.Markdown()
|
| 622 |
+
report_display = gr.Markdown()
|
| 623 |
+
with gr.Column(scale=2):
|
| 624 |
+
plot_display = gr.Plot()
|
| 625 |
+
|
| 626 |
+
# Update displays when model or plot changes
|
| 627 |
+
def update_all(model_name, plot_type):
|
| 628 |
+
metrics, report = update_model_display(model_name)
|
| 629 |
+
plot = get_selected_plot(model_name, plot_type)
|
| 630 |
+
return metrics, report, plot
|
| 631 |
+
|
| 632 |
+
model_selector.change(
|
| 633 |
+
fn=update_all,
|
| 634 |
+
inputs=[model_selector, plot_selector],
|
| 635 |
+
outputs=[metrics_display, report_display, plot_display]
|
| 636 |
+
)
|
| 637 |
+
plot_selector.change(
|
| 638 |
+
fn=update_all,
|
| 639 |
+
inputs=[model_selector, plot_selector],
|
| 640 |
+
outputs=[metrics_display, report_display, plot_display]
|
| 641 |
+
)
|
| 642 |
+
|
| 643 |
+
# Load initial values
|
| 644 |
+
demo.load(
|
| 645 |
+
fn=update_all,
|
| 646 |
+
inputs=[model_selector, plot_selector],
|
| 647 |
+
outputs=[metrics_display, report_display, plot_display]
|
| 648 |
+
)
|
| 649 |
+
|
| 650 |
+
# Tab 3: Model Comparison
|
| 651 |
+
with gr.TabItem("📈 Compare Models"):
|
| 652 |
+
gr.Markdown("## All Models Performance Comparison")
|
| 653 |
+
|
| 654 |
+
comparison_table, comparison_summary, comparison_plot = get_comparison_results()
|
| 655 |
+
|
| 656 |
+
gr.Markdown(comparison_summary)
|
| 657 |
+
gr.Markdown(comparison_table)
|
| 658 |
+
gr.Plot(value=comparison_plot, label="Model Comparison Chart")
|
| 659 |
+
|
| 660 |
+
# Tab 4: Threshold Analysis
|
| 661 |
+
with gr.TabItem("⚖️ Threshold Optimization"):
|
| 662 |
+
gr.Markdown("""
|
| 663 |
+
## Finding the Optimal Classification Threshold
|
| 664 |
+
|
| 665 |
+
In fraud detection, the default 0.5 threshold isn't always optimal.
|
| 666 |
+
We need to balance:
|
| 667 |
+
- **Precision:** Not flagging legitimate claims as fraud (customer experience)
|
| 668 |
+
- **Recall:** Catching actual frauds (financial loss prevention)
|
| 669 |
+
|
| 670 |
+
The optimal threshold maximizes F1 score, which balances both concerns.
|
| 671 |
+
""")
|
| 672 |
+
|
| 673 |
+
thresh_model = gr.Dropdown(
|
| 674 |
+
choices=list(models.keys()),
|
| 675 |
+
value="XGBoost",
|
| 676 |
+
label="Select Model for Threshold Analysis"
|
| 677 |
+
)
|
| 678 |
+
|
| 679 |
+
thresh_plot = gr.Plot()
|
| 680 |
+
|
| 681 |
+
def update_threshold_plot(model_name):
|
| 682 |
+
y_proba = all_probabilities[model_name]
|
| 683 |
+
return plot_threshold_analysis(y_test, y_proba, model_name)
|
| 684 |
+
|
| 685 |
+
thresh_model.change(
|
| 686 |
+
fn=update_threshold_plot,
|
| 687 |
+
inputs=[thresh_model],
|
| 688 |
+
outputs=[thresh_plot]
|
| 689 |
+
)
|
| 690 |
+
|
| 691 |
+
demo.load(
|
| 692 |
+
fn=update_threshold_plot,
|
| 693 |
+
inputs=[thresh_model],
|
| 694 |
+
outputs=[thresh_plot]
|
| 695 |
+
)
|
| 696 |
+
|
| 697 |
+
# Show optimal thresholds for all models
|
| 698 |
+
thresh_summary = "### Optimal Thresholds by Model\n\n| Model | Optimal Threshold | F1 at Optimal |\n|-------|-------------------|---------------|\n"
|
| 699 |
+
for name in models.keys():
|
| 700 |
+
opt_thresh, opt_f1, _, _ = find_optimal_threshold(y_test, all_probabilities[name])
|
| 701 |
+
thresh_summary += f"| {name} | {opt_thresh:.2f} | {opt_f1:.4f} |\n"
|
| 702 |
+
|
| 703 |
+
gr.Markdown(thresh_summary)
|
| 704 |
+
|
| 705 |
+
# Tab 5: About
|
| 706 |
+
with gr.TabItem("���️ About"):
|
| 707 |
+
gr.Markdown("""
|
| 708 |
+
## About This Project
|
| 709 |
+
|
| 710 |
+
### Business Context
|
| 711 |
+
Auto insurance fraud costs the industry billions of dollars annually.
|
| 712 |
+
This project builds machine learning models to automatically flag potentially
|
| 713 |
+
fraudulent claims for further investigation.
|
| 714 |
+
|
| 715 |
+
### Technical Approach
|
| 716 |
+
1. **Data Preparation:** The dataset contains 46 features describing claims and customers
|
| 717 |
+
2. **Class Imbalance:** Only ~3% of claims are fraudulent. We use SMOTE to balance the training data
|
| 718 |
+
3. **Model Training:** Four different algorithms are compared
|
| 719 |
+
4. **Evaluation:** Focus on Precision-Recall metrics due to class imbalance
|
| 720 |
+
5. **Threshold Optimization:** Find the best cutoff for business needs
|
| 721 |
+
|
| 722 |
+
### Models Used
|
| 723 |
+
- **XGBoost:** Gradient boosting with regularization, excellent for tabular data
|
| 724 |
+
- **LightGBM:** Fast gradient boosting, memory efficient
|
| 725 |
+
- **Random Forest:** Ensemble of decision trees, robust and interpretable
|
| 726 |
+
- **Logistic Regression:** Linear baseline model, highly interpretable
|
| 727 |
+
|
| 728 |
+
### Key Metrics Explained
|
| 729 |
+
- **Precision:** Of claims flagged as fraud, how many are actually fraudulent
|
| 730 |
+
- **Recall:** Of actual frauds, how many did we catch
|
| 731 |
+
- **F1 Score:** Harmonic mean of precision and recall
|
| 732 |
+
- **ROC AUC:** Overall discrimination ability
|
| 733 |
+
|
| 734 |
+
### Why Precision-Recall over ROC?
|
| 735 |
+
For highly imbalanced datasets like fraud detection, Precision-Recall curves
|
| 736 |
+
give a more realistic picture of model performance than ROC curves.
|
| 737 |
+
""")
|
| 738 |
+
|
| 739 |
+
|
| 740 |
+
# Launch the app
|
| 741 |
+
if __name__ == "__main__":
|
| 742 |
+
demo.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Auto Insurance Fraud Detection - Dependencies
|
| 2 |
+
# For Hugging Face Spaces (CPU-only, Free Tier)
|
| 3 |
+
|
| 4 |
+
# Core ML Libraries
|
| 5 |
+
pandas==2.0.3
|
| 6 |
+
numpy==1.24.3
|
| 7 |
+
scikit-learn==1.3.0
|
| 8 |
+
xgboost==2.0.0
|
| 9 |
+
lightgbm==4.1.0
|
| 10 |
+
imbalanced-learn==0.11.0
|
| 11 |
+
|
| 12 |
+
# Visualization
|
| 13 |
+
matplotlib==3.7.2
|
| 14 |
+
seaborn==0.12.2
|
| 15 |
+
|
| 16 |
+
# Gradio for web interface
|
| 17 |
+
gradio==4.19.2
|
| 18 |
+
|
| 19 |
+
# Utilities
|
| 20 |
+
tabulate==0.9.0
|
test.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
train.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|