Spaces:

YashChowdhary
/

Auto_Insurance_Claims_Fraud_Detection

Sleeping

App Files Files Community

YashChowdhary commited on Mar 11

Commit

b61d076

verified ·

1 Parent(s): 11db810

Upload 5 files

Browse files

Files changed (5) hide show

BEGINNER_GUIDE.md +360 -0
app.py +742 -0
requirements.txt +20 -0
test.csv +0 -0
train.csv +0 -0

BEGINNER_GUIDE.md ADDED Viewed

	@@ -0,0 +1,360 @@

+# 🚀 Complete Beginner's Guide: Deploying Auto Insurance Fraud Detection on Hugging Face
+This guide walks you through every step of setting up and running the fraud detection project. No prior experience with Hugging Face is required.
+---
+## Table of Contents
+1. [What You'll Need](#what-youll-need)
+2. [Step 1: Create a Hugging Face Account](#step-1-create-a-hugging-face-account)
+3. [Step 2: Create a New Space](#step-2-create-a-new-space)
+4. [Step 3: Upload Your Files](#step-3-upload-your-files)
+5. [Step 4: Wait for Build](#step-4-wait-for-build)
+6. [Step 5: Use Your App](#step-5-use-your-app)
+7. [Troubleshooting Common Issues](#troubleshooting-common-issues)
+8. [Running Locally (Alternative)](#running-locally-alternative)
+9. [Understanding the Output](#understanding-the-output)
+---
+## What You'll Need
+Before starting, make sure you have these 5 files ready in a folder on your computer:
+| File | Description | Size (approx) |
+|------|-------------|---------------|
+| `app.py` | The main Python application code | ~25 KB |
+| `requirements.txt` | List of Python packages needed | ~300 bytes |
+| `train.csv` | Training dataset | ~1.9 MB |
+| `test.csv` | Test dataset | ~470 KB |
+| `README.md` | Documentation for the Space | ~4 KB |
+You should also have:
+- `report.docx` - The APA report (keep this separate, not uploaded to Hugging Face)
+- This guide (`BEGINNER_GUIDE.md`) for reference
+---
+## Step 1: Create a Hugging Face Account
+1. **Go to Hugging Face**: Open your browser and visit [https://huggingface.co](https://huggingface.co)
+2. **Click "Sign Up"**: Look for the button in the top-right corner
+3. **Fill in your details**:
+   - Username: Choose something memorable (e.g., `yourname_student`)
+   - Email: Use your school or personal email
+   - Password: Make it secure
+4. **Verify your email**: Check your inbox and click the verification link
+5. **Complete your profile** (optional but recommended)
+---
+## Step 2: Create a New Space
+Hugging Face "Spaces" are where you host web applications. Here's how to create one:
+1. **Go to Spaces**: After logging in, click on your profile picture → "New Space"
+   Or directly visit: [https://huggingface.co/new-space](https://huggingface.co/new-space)
+2. **Configure your Space**:
+   | Setting | What to Enter |
+   |---------|---------------|
+   | **Space name** | `fraud-detection` (or any name you like) |
+   | **License** | MIT (allows others to use your code) |
+   | **SDK** | Select **Gradio** |
+   | **SDK Version** | Leave as default (or select 4.19.2) |
+   | **Hardware** | **CPU basic** (free tier) |
+   | **Visibility** | Public (or Private if you prefer) |
+3. **Click "Create Space"**
+You now have an empty Space! It will show an error because there's no code yet—that's normal.
+---
+## Step 3: Upload Your Files
+You have two options for uploading files:
+### Option A: Web Interface (Easiest)
+1. **Go to your Space**: Click on your Space name (e.g., `yourusername/fraud-detection`)
+2. **Click "Files and versions"** tab
+3. **Click "+ Add file"** → **"Upload files"**
+4. **Upload these files one by one or all together**:
+   - `app.py`
+   - `requirements.txt`
+   - `train.csv`
+   - `test.csv`
+   - `README.md`
+5. **Commit the changes**: After each upload (or batch), you'll see a "Commit" button. Click it.
+   ⚠️ **Important**: Upload the README.md file. The one you created should replace any default README.
+### Option B: Git (For more advanced users)
+If you're familiar with Git, you can clone and push:
+```bash
+# Clone your space
+git clone https://huggingface.co/spaces/YOUR_USERNAME/fraud-detection
+cd fraud-detection
+# Copy your files into this folder
+cp /path/to/your/files/* .
+# Add, commit, and push
+git add .
+git commit -m "Initial upload of fraud detection app"
+git push
+```
+---
+## Step 4: Wait for Build
+After uploading, Hugging Face automatically builds your app. Here's what happens:
+1. **Building** (1-3 minutes): The status shows "Building"
+   - It installs packages from `requirements.txt`
+   - It prepares the environment
+2. **Running** (3-5 minutes the first time): The status shows "Running"
+   - Your `app.py` code executes
+   - Models are trained
+   - The interface loads
+3. **App Ready**: You'll see your app interface!
+### What the logs show:
+You can click "Logs" to see what's happening:
+```
+Loading data...
+Applying SMOTE to handle class imbalance...
+Training models (this may take a moment)...
+  Training XGBoost...
+  Training LightGBM...
+  Training Random Forest...
+  Training Logistic Regression...
+Models trained successfully!
+Running on local URL:  http://0.0.0.0:7860
+```
+When you see that last line, your app is ready!
+---
+## Step 5: Use Your App
+Once your app is running, you'll see a interactive interface with 5 tabs:
+### Tab 1: 📊 Data Overview
+- Shows dataset statistics
+- Displays class distribution pie charts
+- Explains the imbalance problem
+### Tab 2: ���� Model Evaluation
+- **Select a model** from the dropdown (XGBoost, LightGBM, Random Forest, or Logistic Regression)
+- **Select a visualization** to see:
+  - Precision-Recall Curve
+  - ROC Curve
+  - Confusion Matrix
+  - Feature Importance
+  - Threshold Analysis
+- View performance metrics and classification report
+### Tab 3: 📈 Compare Models
+- Side-by-side comparison of all 4 models
+- Bar chart showing metrics
+- Table with best model for each metric
+### Tab 4: ⚖️ Threshold Optimization
+- Interactive plot showing precision/recall trade-off
+- Table of optimal thresholds for each model
+- Explains why 0.5 isn't always the best threshold
+### Tab 5: ℹ️ About
+- Project documentation
+- Technical details
+- Metrics explanations
+---
+## Troubleshooting Common Issues
+### Issue 1: "Application Error" or Build Fails
+**Possible causes and solutions**:
+| Problem | Solution |
+|---------|----------|
+| Missing file | Check all 5 files are uploaded |
+| Wrong filename | Files must be exactly: `app.py`, `requirements.txt`, `train.csv`, `test.csv`, `README.md` |
+| Corrupted CSV | Re-download the CSV files and upload again |
+| Package conflict | Check the logs for specific error messages |
+### Issue 2: "Out of Memory" Error
+The free tier has limited memory. This shouldn't happen with our code, but if it does:
+- Reduce `n_estimators` in the models (e.g., from 100 to 50)
+- The app automatically uses efficient settings for free tier
+### Issue 3: App Takes Too Long to Load
+Normal behavior! The first load takes 3-5 minutes because:
+- It needs to install packages
+- It trains 4 machine learning models
+- Subsequent visits are faster (cache helps)
+### Issue 4: Graphs Don't Update
+- Try clicking the dropdown again
+- Refresh the page (Ctrl+R or Cmd+R)
+- Wait a few seconds—processing takes time
+### Issue 5: "No such file: train.csv"
+The CSV files weren't uploaded correctly:
+1. Go to "Files and versions"
+2. Verify `train.csv` and `test.csv` are listed
+3. If not, upload them again
+4. Make sure filenames are lowercase
+---
+## Running Locally (Alternative)
+If you prefer to run on your own computer instead of Hugging Face:
+### Step 1: Install Python
+Make sure you have Python 3.8+ installed. Check with:
+```bash
+python --version
+```
+### Step 2: Create a Project Folder
+```bash
+mkdir fraud_detection
+cd fraud_detection
+```
+### Step 3: Copy All Files
+Place these files in your folder:
+- `app.py`
+- `requirements.txt`
+- `train.csv`
+- `test.csv`
+### Step 4: Create a Virtual Environment (Recommended)
+```bash
+# Create virtual environment
+python -m venv venv
+# Activate it
+# On Windows:
+venv\Scripts\activate
+# On Mac/Linux:
+source venv/bin/activate
+```
+### Step 5: Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+This may take 2-5 minutes to download and install all packages.
+### Step 6: Run the App
+```bash
+python app.py
+```
+### Step 7: Open in Browser
+You'll see output like:
+```
+Running on local URL: http://127.0.0.1:7860
+```
+Open that URL in your browser.
+---
+## Understanding the Output
+### What the Metrics Mean
+| Metric | What It Tells You | Good Value |
+|--------|------------------|------------|
+| **Accuracy** | Overall correct predictions | >95% (but misleading for imbalanced data) |
+| **Precision** | When we say "fraud", how often are we right? | >50% is good |
+| **Recall** | Of all actual frauds, how many did we catch? | >70% is good |
+| **F1 Score** | Balance of precision and recall | >0.5 is decent, >0.6 is good |
+| **ROC AUC** | Overall discrimination ability | >0.9 is excellent |
+### Reading the Confusion Matrix
+```
+              Predicted
+              Legit  Fraud
+Actual Legit  [TN]   [FP]
+       Fraud  [FN]   [TP]
+```
+- **TN (True Negative)**: Correctly identified legitimate claims ✓
+- **FP (False Positive)**: Legitimate claims wrongly flagged as fraud ✗
+- **FN (False Negative)**: Frauds we missed ✗
+- **TP (True Positive)**: Correctly caught frauds ✓
+### Interpreting Feature Importance
+The feature importance plot shows which variables most influence the model's decisions:
+- **High importance features** = strong predictors of fraud
+- For example, `total_claim_amount` being important means higher claims correlate with fraud
+- Use this to understand what patterns the model learned
+---
+## File Checklist
+Before deploying, verify you have all files:
+- [ ] `app.py` - Main application (~650 lines of code)
+- [ ] `requirements.txt` - Package list (11 packages)
+- [ ] `train.csv` - 16,001 rows including header
+- [ ] `test.csv` - 4,001 rows including header
+- [ ] `README.md` - Documentation for the Space
+Keep separately (don't upload to Hugging Face):
+- [ ] `report.docx` - Your APA report for submission
+- [ ] `BEGINNER_GUIDE.md` - This guide
+---
+## Tips for Success
+1. **Be patient**: First build takes a few minutes
+2. **Check logs**: They tell you exactly what's happening
+3. **Verify uploads**: Make sure file sizes match expectations
+4. **Use the right SDK**: Must be Gradio, not Streamlit
+5. **Test locally first**: If something doesn't work, debug locally before deploying
+---
+## Need Help?
+- **Hugging Face Documentation**: [https://huggingface.co/docs/hub/spaces](https://huggingface.co/docs/hub/spaces)
+- **Gradio Guides**: [https://gradio.app/guides](https://gradio.app/guides)
+- **Common Issues**: Check the "Troubleshooting" section above
+Good luck with your fraud detection project! 🎉

app.py ADDED Viewed

	@@ -0,0 +1,742 @@

+"""
+Auto Insurance Claims Fraud Detection
+=====================================
+A machine learning application that trains and compares 4 different models
+for detecting fraudulent insurance claims.
+Models: XGBoost, LightGBM, Random Forest, Logistic Regression
+Author: Data Science Project
+"""
+import gradio as gr
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from io import BytesIO
+import base64
+import warnings
+warnings.filterwarnings('ignore')
+# ML Libraries
+from sklearn.model_selection import cross_val_score
+from sklearn.metrics import (
+    precision_recall_curve, roc_curve, auc,
+    confusion_matrix, classification_report,
+    f1_score, precision_score, recall_score, accuracy_score
+)
+from sklearn.linear_model import LogisticRegression
+from sklearn.ensemble import RandomForestClassifier
+from xgboost import XGBClassifier
+from lightgbm import LGBMClassifier
+from imblearn.over_sampling import SMOTE
+# Set style for all plots - using try/except for compatibility
+try:
+    plt.style.use('seaborn-v0_8-whitegrid')
+except:
+    try:
+        plt.style.use('seaborn-whitegrid')
+    except:
+        plt.style.use('ggplot')  # Fallback style
+sns.set_palette("husl")
+# ============================================================================
+# DATA LOADING AND PREPROCESSING
+# ============================================================================
+def load_and_prepare_data():
+    """
+    Load the train and test datasets.
+    The data is already preprocessed and one-hot encoded.
+    """
+    # Load datasets
+    train_df = pd.read_csv('train.csv')
+    test_df = pd.read_csv('test.csv')
+    # Separate features and target
+    # 'fraud' is our target variable (0 = legitimate, 1 = fraudulent)
+    X_train = train_df.drop('fraud', axis=1)
+    y_train = train_df['fraud']
+    X_test = test_df.drop('fraud', axis=1)
+    y_test = test_df['fraud']
+    return X_train, X_test, y_train, y_test, train_df, test_df
+def apply_smote(X_train, y_train):
+    """
+    Apply SMOTE to handle class imbalance.
+    Fraud cases are rare (~3%), so we oversample the minority class.
+    """
+    smote = SMOTE(random_state=42)
+    X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
+    return X_resampled, y_resampled
+# ============================================================================
+# MODEL DEFINITIONS
+# ============================================================================
+def get_models():
+    """
+    Define the 4 models we'll compare.
+    Each model is tuned for imbalanced fraud detection.
+    """
+    models = {
+        'XGBoost': XGBClassifier(
+            n_estimators=100,
+            max_depth=4,
+            learning_rate=0.1,
+            scale_pos_weight=10,  # Helps with imbalanced data
+            random_state=42,
+            use_label_encoder=False,
+            eval_metric='logloss'
+        ),
+        'LightGBM': LGBMClassifier(
+            n_estimators=100,
+            max_depth=4,
+            learning_rate=0.1,
+            class_weight='balanced',  # Handles imbalance internally
+            random_state=42,
+            verbose=-1
+        ),
+        'Random Forest': RandomForestClassifier(
+            n_estimators=100,
+            max_depth=6,
+            class_weight='balanced',
+            random_state=42,
+            n_jobs=-1
+        ),
+        'Logistic Regression': LogisticRegression(
+            class_weight='balanced',
+            max_iter=1000,
+            random_state=42
+        )
+    }
+    return models
+# ============================================================================
+# MODEL TRAINING AND EVALUATION
+# ============================================================================
+def train_model(model, X_train, y_train):
+    """Train a single model and return the fitted model."""
+    model.fit(X_train, y_train)
+    return model
+def evaluate_model(model, X_test, y_test):
+    """
+    Get predictions and probabilities from a trained model.
+    Returns both hard predictions and probability scores.
+    """
+    y_pred = model.predict(X_test)
+    y_proba = model.predict_proba(X_test)[:, 1]  # Probability of fraud
+    return y_pred, y_proba
+def get_metrics(y_test, y_pred, y_proba):
+    """
+    Calculate all relevant metrics for fraud detection.
+    For imbalanced data, we focus on Precision, Recall, and F1.
+    """
+    metrics = {
+        'Accuracy': accuracy_score(y_test, y_pred),
+        'Precision': precision_score(y_test, y_pred, zero_division=0),
+        'Recall': recall_score(y_test, y_pred, zero_division=0),
+        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
+        'ROC AUC': auc(*roc_curve(y_test, y_proba)[:2])
+    }
+    return metrics
+def find_optimal_threshold(y_test, y_proba):
+    """
+    Find the optimal classification threshold using F1 score.
+    Default threshold is 0.5, but for imbalanced data,
+    a different threshold often works better.
+    """
+    thresholds = np.arange(0.1, 0.9, 0.01)
+    f1_scores = []
+    for thresh in thresholds:
+        y_pred_thresh = (y_proba >= thresh).astype(int)
+        f1 = f1_score(y_test, y_pred_thresh, zero_division=0)
+        f1_scores.append(f1)
+    # Find threshold with best F1 score
+    best_idx = np.argmax(f1_scores)
+    best_threshold = thresholds[best_idx]
+    best_f1 = f1_scores[best_idx]
+    return best_threshold, best_f1, thresholds, f1_scores
+# ============================================================================
+# VISUALIZATION FUNCTIONS
+# ============================================================================
+def plot_precision_recall_curve(y_test, y_proba, model_name):
+    """
+    Plot Precision-Recall curve.
+    This is the most important metric for fraud detection because
+    we care about catching frauds (recall) without too many false alarms (precision).
+    """
+    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
+    pr_auc = auc(recall, precision)
+    fig, ax = plt.subplots(figsize=(8, 6))
+    ax.plot(recall, precision, 'b-', linewidth=2, label=f'{model_name} (AUC = {pr_auc:.3f})')
+    ax.fill_between(recall, precision, alpha=0.2)
+    # Add baseline (random classifier)
+    baseline = y_test.mean()
+    ax.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline = {baseline:.3f}')
+    ax.set_xlabel('Recall (Fraud Detection Rate)', fontsize=12)
+    ax.set_ylabel('Precision (True Fraud Rate)', fontsize=12)
+    ax.set_title(f'Precision-Recall Curve: {model_name}', fontsize=14, fontweight='bold')
+    ax.legend(loc='best')
+    ax.set_xlim([0, 1])
+    ax.set_ylim([0, 1])
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    return fig
+def plot_roc_curve(y_test, y_proba, model_name):
+    """
+    Plot ROC curve showing true positive rate vs false positive rate.
+    AUC closer to 1 means better discrimination between fraud and legitimate claims.
+    """
+    fpr, tpr, thresholds = roc_curve(y_test, y_proba)
+    roc_auc = auc(fpr, tpr)
+    fig, ax = plt.subplots(figsize=(8, 6))
+    ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'{model_name} (AUC = {roc_auc:.3f})')
+    ax.fill_between(fpr, tpr, alpha=0.2)
+    # Random classifier line
+    ax.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
+    ax.set_xlabel('False Positive Rate', fontsize=12)
+    ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
+    ax.set_title(f'ROC Curve: {model_name}', fontsize=14, fontweight='bold')
+    ax.legend(loc='lower right')
+    ax.set_xlim([0, 1])
+    ax.set_ylim([0, 1])
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    return fig
+def plot_confusion_matrix(y_test, y_pred, model_name):
+    """
+    Plot confusion matrix as a heatmap.
+    Shows: True Negatives, False Positives, False Negatives, True Positives.
+    """
+    cm = confusion_matrix(y_test, y_pred)
+    fig, ax = plt.subplots(figsize=(8, 6))
+    # Create heatmap with custom colors
+    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
+                xticklabels=['Legitimate', 'Fraud'],
+                yticklabels=['Legitimate', 'Fraud'],
+                annot_kws={'size': 16})
+    ax.set_xlabel('Predicted Label', fontsize=12)
+    ax.set_ylabel('True Label', fontsize=12)
+    ax.set_title(f'Confusion Matrix: {model_name}', fontsize=14, fontweight='bold')
+    # Add text annotations explaining the quadrants
+    total = cm.sum()
+    tn, fp, fn, tp = cm.ravel()
+    text = f"TN: {tn} | FP: {fp}\nFN: {fn} | TP: {tp}"
+    ax.text(1.4, 0.5, text, transform=ax.transAxes, fontsize=10,
+            verticalalignment='center', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
+    plt.tight_layout()
+    return fig
+def plot_feature_importance(model, feature_names, model_name):
+    """
+    Plot top 15 most important features.
+    Different models calculate importance differently:
+    - Tree models: based on split gain
+    - Logistic Regression: based on coefficient magnitude
+    """
+    fig, ax = plt.subplots(figsize=(10, 8))
+    # Get feature importances based on model type
+    if hasattr(model, 'feature_importances_'):
+        importances = model.feature_importances_
+    elif hasattr(model, 'coef_'):
+        importances = np.abs(model.coef_[0])
+    else:
+        # Fallback: return empty plot
+        ax.text(0.5, 0.5, 'Feature importance not available',
+                ha='center', va='center', fontsize=14)
+        return fig
+    # Create dataframe and sort by importance
+    importance_df = pd.DataFrame({
+        'Feature': feature_names,
+        'Importance': importances
+    }).sort_values('Importance', ascending=True).tail(15)
+    # Horizontal bar chart
+    colors = plt.cm.Blues(np.linspace(0.4, 0.8, len(importance_df)))
+    ax.barh(importance_df['Feature'], importance_df['Importance'], color=colors)
+    ax.set_xlabel('Importance Score', fontsize=12)
+    ax.set_title(f'Top 15 Feature Importances: {model_name}', fontsize=14, fontweight='bold')
+    ax.grid(True, alpha=0.3, axis='x')
+    plt.tight_layout()
+    return fig
+def plot_threshold_analysis(y_test, y_proba, model_name):
+    """
+    Plot how different thresholds affect precision, recall, and F1.
+    Helps visualize the trade-off and find the optimal threshold.
+    """
+    thresholds = np.arange(0.05, 0.95, 0.01)
+    precisions = []
+    recalls = []
+    f1_scores = []
+    for thresh in thresholds:
+        y_pred_thresh = (y_proba >= thresh).astype(int)
+        precisions.append(precision_score(y_test, y_pred_thresh, zero_division=0))
+        recalls.append(recall_score(y_test, y_pred_thresh, zero_division=0))
+        f1_scores.append(f1_score(y_test, y_pred_thresh, zero_division=0))
+    # Find optimal threshold
+    best_idx = np.argmax(f1_scores)
+    best_threshold = thresholds[best_idx]
+    fig, ax = plt.subplots(figsize=(10, 6))
+    ax.plot(thresholds, precisions, 'b-', linewidth=2, label='Precision')
+    ax.plot(thresholds, recalls, 'g-', linewidth=2, label='Recall')
+    ax.plot(thresholds, f1_scores, 'r-', linewidth=2, label='F1 Score')
+    # Mark optimal threshold
+    ax.axvline(x=best_threshold, color='purple', linestyle='--',
+               label=f'Optimal Threshold = {best_threshold:.2f}')
+    ax.axvline(x=0.5, color='gray', linestyle=':', alpha=0.7, label='Default (0.5)')
+    ax.set_xlabel('Classification Threshold', fontsize=12)
+    ax.set_ylabel('Score', fontsize=12)
+    ax.set_title(f'Threshold Analysis: {model_name}', fontsize=14, fontweight='bold')
+    ax.legend(loc='best')
+    ax.set_xlim([0, 1])
+    ax.set_ylim([0, 1])
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    return fig
+def plot_class_distribution(train_df, test_df):
+    """
+    Plot the class distribution showing imbalance.
+    Fraud is rare (~3%) which is typical in real-world scenarios.
+    """
+    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
+    # Training data distribution
+    train_counts = train_df['fraud'].value_counts()
+    colors = ['#2ecc71', '#e74c3c']
+    axes[0].pie(train_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
+                colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
+    axes[0].set_title('Training Data Distribution', fontsize=14, fontweight='bold')
+    # Test data distribution
+    test_counts = test_df['fraud'].value_counts()
+    axes[1].pie(test_counts, labels=['Legitimate', 'Fraud'], autopct='%1.1f%%',
+                colors=colors, explode=(0, 0.1), shadow=True, startangle=90)
+    axes[1].set_title('Test Data Distribution', fontsize=14, fontweight='bold')
+    plt.suptitle('Class Imbalance in Fraud Detection Dataset', fontsize=16, fontweight='bold', y=1.02)
+    plt.tight_layout()
+    return fig
+def plot_model_comparison(all_metrics):
+    """
+    Bar chart comparing all 4 models across different metrics.
+    """
+    fig, ax = plt.subplots(figsize=(12, 6))
+    models = list(all_metrics.keys())
+    metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']
+    x = np.arange(len(metrics))
+    width = 0.2
+    colors = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6']
+    for i, model in enumerate(models):
+        values = [all_metrics[model][m] for m in metrics]
+        ax.bar(x + i*width, values, width, label=model, color=colors[i])
+    ax.set_ylabel('Score', fontsize=12)
+    ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
+    ax.set_xticks(x + width * 1.5)
+    ax.set_xticklabels(metrics)
+    ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
+    ax.set_ylim([0, 1])
+    ax.grid(True, alpha=0.3, axis='y')
+    # Add value labels on bars
+    for i, model in enumerate(models):
+        values = [all_metrics[model][m] for m in metrics]
+        for j, v in enumerate(values):
+            ax.text(x[j] + i*width, v + 0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=8)
+    plt.tight_layout()
+    return fig
+# ============================================================================
+# GLOBAL VARIABLES (loaded once at startup)
+# ============================================================================
+print("Loading data...")
+X_train, X_test, y_train, y_test, train_df, test_df = load_and_prepare_data()
+print("Applying SMOTE to handle class imbalance...")
+X_train_balanced, y_train_balanced = apply_smote(X_train, y_train)
+print("Training models (this may take a moment)...")
+models = get_models()
+trained_models = {}
+all_metrics = {}
+all_predictions = {}
+all_probabilities = {}
+for name, model in models.items():
+    print(f"  Training {name}...")
+    trained_models[name] = train_model(model, X_train_balanced, y_train_balanced)
+    y_pred, y_proba = evaluate_model(trained_models[name], X_test, y_test)
+    all_predictions[name] = y_pred
+    all_probabilities[name] = y_proba
+    all_metrics[name] = get_metrics(y_test, y_pred, y_proba)
+print("Models trained successfully!")
+# ============================================================================
+# GRADIO INTERFACE FUNCTIONS
+# ============================================================================
+def get_data_overview():
+    """Return a summary of the dataset."""
+    summary = f"""
+## Dataset Overview
+### Training Data
+- **Total Samples:** {len(train_df):,}
+- **Fraud Cases:** {train_df['fraud'].sum():,} ({train_df['fraud'].mean()*100:.2f}%)
+- **Legitimate Cases:** {(train_df['fraud']==0).sum():,} ({(1-train_df['fraud'].mean())*100:.2f}%)
+### Test Data
+- **Total Samples:** {len(test_df):,}
+- **Fraud Cases:** {test_df['fraud'].sum():,} ({test_df['fraud'].mean()*100:.2f}%)
+- **Legitimate Cases:** {(test_df['fraud']==0).sum():,} ({(1-test_df['fraud'].mean())*100:.2f}%)
+### Features
+- **Number of Features:** {X_train.shape[1]}
+- **Feature Types:** All numeric (pre-processed and one-hot encoded)
+### Class Imbalance Handling
+- Applied **SMOTE** (Synthetic Minority Over-sampling Technique)
+- Training samples after SMOTE: {len(X_train_balanced):,}
+"""
+    return summary
+def update_model_display(model_name):
+    """
+    Update all displays when a model is selected.
+    Returns metrics, classification report, and optimal threshold info.
+    """
+    metrics = all_metrics[model_name]
+    y_pred = all_predictions[model_name]
+    y_proba = all_probabilities[model_name]
+    # Get optimal threshold
+    best_thresh, best_f1, _, _ = find_optimal_threshold(y_test, y_proba)
+    # Create metrics display
+    metrics_text = f"""
+## {model_name} Performance Metrics
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | {metrics['Accuracy']:.4f} |
+| **Precision** | {metrics['Precision']:.4f} |
+| **Recall** | {metrics['Recall']:.4f} |
+| **F1 Score** | {metrics['F1 Score']:.4f} |
+| **ROC AUC** | {metrics['ROC AUC']:.4f} |
+### Threshold Optimization
+- **Default Threshold:** 0.50
+- **Optimal Threshold:** {best_thresh:.2f}
+- **F1 at Optimal:** {best_f1:.4f}
+"""
+    # Classification report
+    report = classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud'])
+    report_text = f"```\n{report}\n```"
+    return metrics_text, report_text
+def get_selected_plot(model_name, plot_type):
+    """
+    Generate the selected plot for the chosen model.
+    """
+    y_proba = all_probabilities[model_name]
+    y_pred = all_predictions[model_name]
+    if plot_type == "Precision-Recall Curve":
+        return plot_precision_recall_curve(y_test, y_proba, model_name)
+    elif plot_type == "ROC Curve":
+        return plot_roc_curve(y_test, y_proba, model_name)
+    elif plot_type == "Confusion Matrix":
+        return plot_confusion_matrix(y_test, y_pred, model_name)
+    elif plot_type == "Feature Importance":
+        return plot_feature_importance(trained_models[model_name], X_train.columns, model_name)
+    elif plot_type == "Threshold Analysis":
+        return plot_threshold_analysis(y_test, y_proba, model_name)
+    else:
+        return None
+def get_comparison_results():
+    """Generate comparison table and plot."""
+    # Create comparison dataframe
+    comparison_df = pd.DataFrame(all_metrics).T
+    comparison_df = comparison_df.round(4)
+    # Find best model for each metric
+    best_models = comparison_df.idxmax()
+    summary = "## Model Comparison Summary\n\n"
+    summary += "| Metric | Best Model | Score |\n|--------|------------|-------|\n"
+    for metric in comparison_df.columns:
+        best = best_models[metric]
+        score = comparison_df.loc[best, metric]
+        summary += f"| {metric} | {best} | {score:.4f} |\n"
+    return comparison_df.to_markdown(), summary, plot_model_comparison(all_metrics)
+def predict_single_claim(model_name, threshold, *feature_values):
+    """
+    Make prediction for a single claim using selected model and threshold.
+    """
+    model = trained_models[model_name]
+    # Create feature array
+    features = np.array(feature_values).reshape(1, -1)
+    # Get probability
+    proba = model.predict_proba(features)[0, 1]
+    # Apply threshold
+    prediction = 1 if proba >= threshold else 0
+    result = f"""
+## Prediction Result
+**Model:** {model_name}
+**Threshold:** {threshold:.2f}
+### Output
+- **Fraud Probability:** {proba:.4f} ({proba*100:.2f}%)
+- **Prediction:** {'🚨 FRAUDULENT' if prediction == 1 else '✅ LEGITIMATE'}
+### Interpretation
+"""
+    if prediction == 1:
+        result += "This claim has a high probability of being fraudulent and should be flagged for further investigation."
+    else:
+        result += "This claim appears to be legitimate based on the model's analysis."
+    return result
+# ============================================================================
+# GRADIO UI LAYOUT
+# ============================================================================
+# Create the Gradio interface
+with gr.Blocks(title="Auto Insurance Fraud Detection", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 🚗 Auto Insurance Claims Fraud Detection
+    This application demonstrates machine learning models for detecting fraudulent auto insurance claims.
+    The models are trained on historical claims data and can predict whether a new claim is likely to be fraudulent.
+    **Models Available:** XGBoost, LightGBM, Random Forest, Logistic Regression
+    """)
+    with gr.Tabs():
+        # Tab 1: Data Overview
+        with gr.TabItem("📊 Data Overview"):
+            gr.Markdown(get_data_overview())
+            with gr.Row():
+                dist_plot = gr.Plot(value=plot_class_distribution(train_df, test_df),
+                                   label="Class Distribution")
+        # Tab 2: Model Evaluation
+        with gr.TabItem("🎯 Model Evaluation"):
+            with gr.Row():
+                model_selector = gr.Dropdown(
+                    choices=list(models.keys()),
+                    value="XGBoost",
+                    label="Select Model"
+                )
+                plot_selector = gr.Dropdown(
+                    choices=["Precision-Recall Curve", "ROC Curve", "Confusion Matrix",
+                            "Feature Importance", "Threshold Analysis"],
+                    value="Precision-Recall Curve",
+                    label="Select Visualization"
+                )
+            with gr.Row():
+                with gr.Column(scale=1):
+                    metrics_display = gr.Markdown()
+                    report_display = gr.Markdown()
+                with gr.Column(scale=2):
+                    plot_display = gr.Plot()
+            # Update displays when model or plot changes
+            def update_all(model_name, plot_type):
+                metrics, report = update_model_display(model_name)
+                plot = get_selected_plot(model_name, plot_type)
+                return metrics, report, plot
+            model_selector.change(
+                fn=update_all,
+                inputs=[model_selector, plot_selector],
+                outputs=[metrics_display, report_display, plot_display]
+            )
+            plot_selector.change(
+                fn=update_all,
+                inputs=[model_selector, plot_selector],
+                outputs=[metrics_display, report_display, plot_display]
+            )
+            # Load initial values
+            demo.load(
+                fn=update_all,
+                inputs=[model_selector, plot_selector],
+                outputs=[metrics_display, report_display, plot_display]
+            )
+        # Tab 3: Model Comparison
+        with gr.TabItem("📈 Compare Models"):
+            gr.Markdown("## All Models Performance Comparison")
+            comparison_table, comparison_summary, comparison_plot = get_comparison_results()
+            gr.Markdown(comparison_summary)
+            gr.Markdown(comparison_table)
+            gr.Plot(value=comparison_plot, label="Model Comparison Chart")
+        # Tab 4: Threshold Analysis
+        with gr.TabItem("⚖️ Threshold Optimization"):
+            gr.Markdown("""
+            ## Finding the Optimal Classification Threshold
+            In fraud detection, the default 0.5 threshold isn't always optimal.
+            We need to balance:
+            - **Precision:** Not flagging legitimate claims as fraud (customer experience)
+            - **Recall:** Catching actual frauds (financial loss prevention)
+            The optimal threshold maximizes F1 score, which balances both concerns.
+            """)
+            thresh_model = gr.Dropdown(
+                choices=list(models.keys()),
+                value="XGBoost",
+                label="Select Model for Threshold Analysis"
+            )
+            thresh_plot = gr.Plot()
+            def update_threshold_plot(model_name):
+                y_proba = all_probabilities[model_name]
+                return plot_threshold_analysis(y_test, y_proba, model_name)
+            thresh_model.change(
+                fn=update_threshold_plot,
+                inputs=[thresh_model],
+                outputs=[thresh_plot]
+            )
+            demo.load(
+                fn=update_threshold_plot,
+                inputs=[thresh_model],
+                outputs=[thresh_plot]
+            )
+            # Show optimal thresholds for all models
+            thresh_summary = "### Optimal Thresholds by Model\n\n| Model | Optimal Threshold | F1 at Optimal |\n|-------|-------------------|---------------|\n"
+            for name in models.keys():
+                opt_thresh, opt_f1, _, _ = find_optimal_threshold(y_test, all_probabilities[name])
+                thresh_summary += f"| {name} | {opt_thresh:.2f} | {opt_f1:.4f} |\n"
+            gr.Markdown(thresh_summary)
+        # Tab 5: About
+        with gr.TabItem("���️ About"):
+            gr.Markdown("""
+            ## About This Project
+            ### Business Context
+            Auto insurance fraud costs the industry billions of dollars annually.
+            This project builds machine learning models to automatically flag potentially
+            fraudulent claims for further investigation.
+            ### Technical Approach
+            1. **Data Preparation:** The dataset contains 46 features describing claims and customers
+            2. **Class Imbalance:** Only ~3% of claims are fraudulent. We use SMOTE to balance the training data
+            3. **Model Training:** Four different algorithms are compared
+            4. **Evaluation:** Focus on Precision-Recall metrics due to class imbalance
+            5. **Threshold Optimization:** Find the best cutoff for business needs
+            ### Models Used
+            - **XGBoost:** Gradient boosting with regularization, excellent for tabular data
+            - **LightGBM:** Fast gradient boosting, memory efficient
+            - **Random Forest:** Ensemble of decision trees, robust and interpretable
+            - **Logistic Regression:** Linear baseline model, highly interpretable
+            ### Key Metrics Explained
+            - **Precision:** Of claims flagged as fraud, how many are actually fraudulent
+            - **Recall:** Of actual frauds, how many did we catch
+            - **F1 Score:** Harmonic mean of precision and recall
+            - **ROC AUC:** Overall discrimination ability
+            ### Why Precision-Recall over ROC?
+            For highly imbalanced datasets like fraud detection, Precision-Recall curves
+            give a more realistic picture of model performance than ROC curves.
+            """)
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+# Auto Insurance Fraud Detection - Dependencies
+# For Hugging Face Spaces (CPU-only, Free Tier)
+# Core ML Libraries
+pandas==2.0.3
+numpy==1.24.3
+scikit-learn==1.3.0
+xgboost==2.0.0
+lightgbm==4.1.0
+imbalanced-learn==0.11.0
+# Visualization
+matplotlib==3.7.2
+seaborn==0.12.2
+# Gradio for web interface
+gradio==4.19.2
+# Utilities
+tabulate==0.9.0

test.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

train.csv ADDED Viewed

The diff for this file is too large to render. See raw diff