kusssssssss's picture
Update README.md
490d0cd verified
---
license: apache-2.0
language:
- en
- fr
metrics:
- accuracy
- f1
- recall
- precision
- matthews_correlation
pipeline_tag: tabular-classification
tags:
- finance
---
# πŸ’³ Credit Card Fraud Detection with Random Forest
## πŸ“š Project Description
This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a **Random Forest Classifier** optimized for performance and robustness.
---
## πŸ“ Dataset Overview
- **Source**: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
- **Description**: Transactions made by European cardholders in September 2013.
- **Total Samples**: 284,807 transactions
- **Fraudulent Cases**: 492 (~0.172%)
- **Features**:
- `Time`: Time elapsed from the first transaction
- `Amount`: Transaction amount
- `V1` to `V28`: Principal components (PCA-transformed)
- `Class`: Target (0 = Legitimate, 1 = Fraudulent)
---
## 🧠 Model Used
### `RandomForestClassifier` Configuration:
```python
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(
n_estimators=500,
max_depth=20,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
bootstrap=True,
random_state=42,
n_jobs=-1
)
```
---
## πŸ“Š Model Evaluation Metrics
| Metric | Value |
|----------------------------------|-----------|
| **Accuracy** | 0.9996 |
| **Precision** | 0.9747 |
| **Recall (Sensitivity)** | 0.7857 |
| **F1 Score** | 0.8701 |
| **Matthews Correlation Coefficient (MCC)** | 0.8749 |
πŸ“Œ **Interpretation**:
- **High accuracy** is expected due to class imbalance.
- **Precision** is high: most predicted frauds are true frauds.
- **Recall** is moderate: some frauds are missed.
- **F1 score** balances precision and recall.
- **MCC** gives a reliable measure even with class imbalance.
---
## ⏱️ Performance Timing
| Phase | Time (seconds) |
|--------------------|----------------|
| Training | 375.41 |
| Prediction | 0.94 |
---
## πŸ“¦ Exported Artifacts
- `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
- `features.json`: Feature list used during training
---
## πŸš€ Usage Guide
### 1️⃣ Install Dependencies
```bash
pip install pandas scikit-learn joblib
```
---
### 2️⃣ Load Model and Features
```python
import joblib
import json
import pandas as pd
# Load the trained model
model = joblib.load("random_forest_model_fraud_classification.pkl")
# Load the feature list
with open("features.json", "r") as f:
features = json.load(f)
```
---
### 3️⃣ Prepare Input Data
```python
# Load your new transaction data
df = pd.read_csv("your_new_transactions.csv")
# Filter to keep only relevant features
df = df[features]
```
---
### 4️⃣ Make Predictions
```python
# Predict classes
predictions = model.predict(df)
# Predict fraud probability
probabilities = model.predict_proba(df)[:, 1]
print(predictions)
print(probabilities)
```
---
## πŸ“Œ Notes
- Due to the **high class imbalance**, precision and recall should always be monitored.
- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
- The model generalizes well but should be retrained periodically with new data.
---
## πŸ™ Acknowledgements
- Dataset provided by ULB & Worldline
- Original research: *Dal Pozzolo et al.*
- [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
---
## πŸ“ƒ License
Apache License 2.0 β€” you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License.