Update README.md

490d0cd verified 7 months ago

3.9 kB

	---
	license: apache-2.0
	language:
	- en
	- fr
	metrics:
	- accuracy
	- f1
	- recall
	- precision
	- matthews_correlation
	pipeline_tag: tabular-classification
	tags:
	- finance
	---
	# 💳 Credit Card Fraud Detection with Random Forest

	## 📚 Project Description

	This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a Random Forest Classifier optimized for performance and robustness.

	---

	## 📁 Dataset Overview

	- Source: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
	- Description: Transactions made by European cardholders in September 2013.
	- Total Samples: 284,807 transactions
	- Fraudulent Cases: 492 (~0.172%)
	- Features:
	- `Time`: Time elapsed from the first transaction
	- `Amount`: Transaction amount
	- `V1` to `V28`: Principal components (PCA-transformed)
	- `Class`: Target (0 = Legitimate, 1 = Fraudulent)

	---

	## 🧠 Model Used

	### `RandomForestClassifier` Configuration:

	```python
	from sklearn.ensemble import RandomForestClassifier

	rfc = RandomForestClassifier(
	n_estimators=500,
	max_depth=20,
	min_samples_split=2,
	min_samples_leaf=1,
	max_features='sqrt',
	bootstrap=True,
	random_state=42,
	n_jobs=-1
	)
	```

	---

	## 📊 Model Evaluation Metrics

	\| Metric \| Value \|
	\|----------------------------------\|-----------\|
	\| Accuracy \| 0.9996 \|
	\| Precision \| 0.9747 \|
	\| Recall (Sensitivity) \| 0.7857 \|
	\| F1 Score \| 0.8701 \|
	\| Matthews Correlation Coefficient (MCC) \| 0.8749 \|

	📌 Interpretation:
	- High accuracy is expected due to class imbalance.
	- Precision is high: most predicted frauds are true frauds.
	- Recall is moderate: some frauds are missed.
	- F1 score balances precision and recall.
	- MCC gives a reliable measure even with class imbalance.

	---

	## ⏱️ Performance Timing

	\| Phase \| Time (seconds) \|
	\|--------------------\|----------------\|
	\| Training \| 375.41 \|
	\| Prediction \| 0.94 \|

	---

	## 📦 Exported Artifacts

	- `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
	- `features.json`: Feature list used during training

	---

	## 🚀 Usage Guide

	### 1️⃣ Install Dependencies

	```bash
	pip install pandas scikit-learn joblib
	```

	---

	### 2️⃣ Load Model and Features

	```python
	import joblib
	import json
	import pandas as pd

	# Load the trained model
	model = joblib.load("random_forest_model_fraud_classification.pkl")

	# Load the feature list
	with open("features.json", "r") as f:
	features = json.load(f)
	```

	---

	### 3️⃣ Prepare Input Data

	```python
	# Load your new transaction data
	df = pd.read_csv("your_new_transactions.csv")

	# Filter to keep only relevant features
	df = df[features]
	```

	---

	### 4️⃣ Make Predictions

	```python
	# Predict classes
	predictions = model.predict(df)

	# Predict fraud probability
	probabilities = model.predict_proba(df)[:, 1]

	print(predictions)
	print(probabilities)
	```

	---

	## 📌 Notes

	- Due to the high class imbalance, precision and recall should always be monitored.
	- Adjust the decision threshold to optimize for recall or precision depending on your business needs.
	- The model generalizes well but should be retrained periodically with new data.

	---

	## 🙏 Acknowledgements

	- Dataset provided by ULB & Worldline
	- Original research: Dal Pozzolo et al.
	- [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

	---

	## 📃 License

	Apache License 2.0 — you are free to use, modify, and distribute this project under the terms of the Apache 2.0 License.