File size: 3,900 Bytes
490d0cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4c2c66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
490d0cd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: apache-2.0
language:
- en
- fr
metrics:
- accuracy
- f1
- recall
- precision
- matthews_correlation
pipeline_tag: tabular-classification
tags:
- finance
---
# πŸ’³ Credit Card Fraud Detection with Random Forest

## πŸ“š Project Description

This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a **Random Forest Classifier** optimized for performance and robustness.

---

## πŸ“ Dataset Overview

- **Source**: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
- **Description**: Transactions made by European cardholders in September 2013.
- **Total Samples**: 284,807 transactions  
- **Fraudulent Cases**: 492 (~0.172%)  
- **Features**:
  - `Time`: Time elapsed from the first transaction  
  - `Amount`: Transaction amount  
  - `V1` to `V28`: Principal components (PCA-transformed)  
  - `Class`: Target (0 = Legitimate, 1 = Fraudulent)

---

## 🧠 Model Used

### `RandomForestClassifier` Configuration:

```python
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(
    n_estimators=500,
    max_depth=20,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)
```

---

## πŸ“Š Model Evaluation Metrics

| Metric                           | Value     |
|----------------------------------|-----------|
| **Accuracy**                     | 0.9996    |
| **Precision**                    | 0.9747    |
| **Recall (Sensitivity)**         | 0.7857    |
| **F1 Score**                     | 0.8701    |
| **Matthews Correlation Coefficient (MCC)** | 0.8749 |

πŸ“Œ **Interpretation**:
- **High accuracy** is expected due to class imbalance.
- **Precision** is high: most predicted frauds are true frauds.
- **Recall** is moderate: some frauds are missed.
- **F1 score** balances precision and recall.
- **MCC** gives a reliable measure even with class imbalance.

---

## ⏱️ Performance Timing

| Phase              | Time (seconds) |
|--------------------|----------------|
| Training           | 375.41         |
| Prediction         | 0.94           |

---

## πŸ“¦ Exported Artifacts

- `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
- `features.json`: Feature list used during training

---

## πŸš€ Usage Guide

### 1️⃣ Install Dependencies

```bash
pip install pandas scikit-learn joblib
```

---

### 2️⃣ Load Model and Features

```python
import joblib
import json
import pandas as pd

# Load the trained model
model = joblib.load("random_forest_model_fraud_classification.pkl")

# Load the feature list
with open("features.json", "r") as f:
    features = json.load(f)
```

---

### 3️⃣ Prepare Input Data

```python
# Load your new transaction data
df = pd.read_csv("your_new_transactions.csv")

# Filter to keep only relevant features
df = df[features]
```

---

### 4️⃣ Make Predictions

```python
# Predict classes
predictions = model.predict(df)

# Predict fraud probability
probabilities = model.predict_proba(df)[:, 1]

print(predictions)
print(probabilities)
```

---

## πŸ“Œ Notes

- Due to the **high class imbalance**, precision and recall should always be monitored.
- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
- The model generalizes well but should be retrained periodically with new data.

---

## πŸ™ Acknowledgements

- Dataset provided by ULB & Worldline  
- Original research: *Dal Pozzolo et al.*  
- [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

---

## πŸ“ƒ License

Apache License 2.0 β€” you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License.