---
license: apache-2.0
language:
- en
- fr
metrics:
- accuracy
- f1
- recall
- precision
- matthews_correlation
pipeline_tag: tabular-classification
tags:
- finance
---
# 💳 Credit Card Fraud Detection with Random Forest

## 📚 Project Description

This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a **Random Forest Classifier** optimized for performance and robustness.

---

## 📁 Dataset Overview

- **Source**: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
- **Description**: Transactions made by European cardholders in September 2013.
- **Total Samples**: 284,807 transactions  
- **Fraudulent Cases**: 492 (~0.172%)  
- **Features**:
  - `Time`: Time elapsed from the first transaction  
  - `Amount`: Transaction amount  
  - `V1` to `V28`: Principal components (PCA-transformed)  
  - `Class`: Target (0 = Legitimate, 1 = Fraudulent)

---

## 🧠 Model Used

### `RandomForestClassifier` Configuration:

```python
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(
    n_estimators=500,
    max_depth=20,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)
```

---

## 📊 Model Evaluation Metrics

| Metric                           | Value     |
|----------------------------------|-----------|
| **Accuracy**                     | 0.9996    |
| **Precision**                    | 0.9747    |
| **Recall (Sensitivity)**         | 0.7857    |
| **F1 Score**                     | 0.8701    |
| **Matthews Correlation Coefficient (MCC)** | 0.8749 |

📌 **Interpretation**:
- **High accuracy** is expected due to class imbalance.
- **Precision** is high: most predicted frauds are true frauds.
- **Recall** is moderate: some frauds are missed.
- **F1 score** balances precision and recall.
- **MCC** gives a reliable measure even with class imbalance.

---

## ⏱️ Performance Timing

| Phase              | Time (seconds) |
|--------------------|----------------|
| Training           | 375.41         |
| Prediction         | 0.94           |

---

## 📦 Exported Artifacts

- `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
- `features.json`: Feature list used during training

---

## 🚀 Usage Guide

### 1️⃣ Install Dependencies

```bash
pip install pandas scikit-learn joblib
```

---

### 2️⃣ Load Model and Features

```python
import joblib
import json
import pandas as pd

# Load the trained model
model = joblib.load("random_forest_model_fraud_classification.pkl")

# Load the feature list
with open("features.json", "r") as f:
    features = json.load(f)
```

---

### 3️⃣ Prepare Input Data

```python
# Load your new transaction data
df = pd.read_csv("your_new_transactions.csv")

# Filter to keep only relevant features
df = df[features]
```

---

### 4️⃣ Make Predictions

```python
# Predict classes
predictions = model.predict(df)

# Predict fraud probability
probabilities = model.predict_proba(df)[:, 1]

print(predictions)
print(probabilities)
```

---

## 📌 Notes

- Due to the **high class imbalance**, precision and recall should always be monitored.
- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
- The model generalizes well but should be retrained periodically with new data.

---

## 🙏 Acknowledgements

- Dataset provided by ULB & Worldline  
- Original research: *Dal Pozzolo et al.*  
- [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

---

## 📃 License

Apache License 2.0 — you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License.