File size: 3,900 Bytes
490d0cd a4c2c66 490d0cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: apache-2.0
language:
- en
- fr
metrics:
- accuracy
- f1
- recall
- precision
- matthews_correlation
pipeline_tag: tabular-classification
tags:
- finance
---
# π³ Credit Card Fraud Detection with Random Forest
## π Project Description
This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a **Random Forest Classifier** optimized for performance and robustness.
---
## π Dataset Overview
- **Source**: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
- **Description**: Transactions made by European cardholders in September 2013.
- **Total Samples**: 284,807 transactions
- **Fraudulent Cases**: 492 (~0.172%)
- **Features**:
- `Time`: Time elapsed from the first transaction
- `Amount`: Transaction amount
- `V1` to `V28`: Principal components (PCA-transformed)
- `Class`: Target (0 = Legitimate, 1 = Fraudulent)
---
## π§ Model Used
### `RandomForestClassifier` Configuration:
```python
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(
n_estimators=500,
max_depth=20,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
bootstrap=True,
random_state=42,
n_jobs=-1
)
```
---
## π Model Evaluation Metrics
| Metric | Value |
|----------------------------------|-----------|
| **Accuracy** | 0.9996 |
| **Precision** | 0.9747 |
| **Recall (Sensitivity)** | 0.7857 |
| **F1 Score** | 0.8701 |
| **Matthews Correlation Coefficient (MCC)** | 0.8749 |
π **Interpretation**:
- **High accuracy** is expected due to class imbalance.
- **Precision** is high: most predicted frauds are true frauds.
- **Recall** is moderate: some frauds are missed.
- **F1 score** balances precision and recall.
- **MCC** gives a reliable measure even with class imbalance.
---
## β±οΈ Performance Timing
| Phase | Time (seconds) |
|--------------------|----------------|
| Training | 375.41 |
| Prediction | 0.94 |
---
## π¦ Exported Artifacts
- `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
- `features.json`: Feature list used during training
---
## π Usage Guide
### 1οΈβ£ Install Dependencies
```bash
pip install pandas scikit-learn joblib
```
---
### 2οΈβ£ Load Model and Features
```python
import joblib
import json
import pandas as pd
# Load the trained model
model = joblib.load("random_forest_model_fraud_classification.pkl")
# Load the feature list
with open("features.json", "r") as f:
features = json.load(f)
```
---
### 3οΈβ£ Prepare Input Data
```python
# Load your new transaction data
df = pd.read_csv("your_new_transactions.csv")
# Filter to keep only relevant features
df = df[features]
```
---
### 4οΈβ£ Make Predictions
```python
# Predict classes
predictions = model.predict(df)
# Predict fraud probability
probabilities = model.predict_proba(df)[:, 1]
print(predictions)
print(probabilities)
```
---
## π Notes
- Due to the **high class imbalance**, precision and recall should always be monitored.
- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
- The model generalizes well but should be retrained periodically with new data.
---
## π Acknowledgements
- Dataset provided by ULB & Worldline
- Original research: *Dal Pozzolo et al.*
- [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
---
## π License
Apache License 2.0 β you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License. |