---
title: Credit Card Fraud Detection
emoji: 🛡️
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
---

# 🛡️ Credit Card Fraud Detection System

A machine learning pipeline for detecting fraudulent credit card transactions in real-time using an ensemble of LightGBM, XGBoost, and Random Forest models.

## 🚀 Quick Start

1. **Upload a CSV file** with transaction data
2. **Adjust the threshold** (recommended: 0.01-0.05 for imbalanced data)
3. **Click "Detect Fraud"** to analyze transactions
4. **View results** across 15 interactive visualizations

## 📊 Features

- **Real-time fraud scoring** with calibrated probabilities
- **15 interactive visualizations** including:
  - Fraud probability distributions
  - Risk level breakdowns
  - Time series analysis
  - Model performance metrics (when ground truth available)
  - ROC and Precision-Recall curves
  - Feature correlation heatmaps
  - Threshold sensitivity analysis

## 📁 Required CSV Format

Your CSV should include at minimum:
- `unix_time` or timestamp column
- `amt` or amount column
- `city_pop` (city population)
- `dist_home_merch` (distance from home to merchant)
- `category` (transaction category)

**Note:** If you're missing velocity features (like `txn_count_last_1h`), the system will fill them with sensible defaults.

## 🎯 Model Details

- **25 engineered features** including time-based, velocity, and aggregated features
- **Ensemble approach**: LightGBM + XGBoost + Random Forest
- **Calibrated probabilities** for reliable threshold tuning
- **Handles imbalanced data** (typical fraud rate: 0.2%)

## 📈 Expected Performance

- **ROC-AUC**: ~0.81 (good discrimination)
- **PR-AUC**: 0.01-0.10 (typical for imbalanced data)
- **Precision**: 0.01-0.20 (depends on threshold)
- **Recall**: 0.50-0.95 (depends on threshold)

## 💡 Usage Tips

1. **Threshold Selection**: For imbalanced fraud data, use **0.01-0.05** instead of 0.5. The default 0.5 is too high and will miss most fraud.

2. **File Size**: Processing is limited to 10,000 rows for optimal performance.

3. **Ground Truth**: If your CSV includes a fraud label column (`is_fraud`, `fraud`, `target`, etc.), the app will automatically calculate model performance metrics.

## 🔧 Technical Details

- **Framework**: Gradio Blocks API
- **Visualizations**: Plotly (interactive charts)
- **Model**: Calibrated LightGBM ensemble
- **Features**: 25 engineered features with automatic feature engineering

## 📝 Model File

**Important**: This Space requires the model file `fraud_lgbm_calibrated.pkl` to be present. 

If deploying this Space:
1. Train the model using `train_improved_model.py` (if you have the training script)
2. Upload the model file to the Space repository
3. Or use Git LFS for large model files

## 🔗 Related Resources

- Full project documentation: See `README.md` in the repository
- Model training: `train_improved_model.py`
- Sample data generator: `generate_sample_dataset.py`
- Power BI integration: `powerbi_export.py`

## 📄 License

This project is for educational and portfolio purposes. Ensure you have proper data usage rights before processing real transaction data.

---

Built with Python, LightGBM, XGBoost, Gradio, and Plotly.