kusssssssss commited on
Commit
a4c2c66
Β·
verified Β·
1 Parent(s): 1b924d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md CHANGED
@@ -1,3 +1,166 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - fr
6
+ metrics:
7
+ - accuracy
8
+ - f1
9
+ - recall
10
+ - precision
11
+ - matthews_correlation
12
+ pipeline_tag: tabular-classification
13
+ tags:
14
+ - finance
15
  ---
16
+ # πŸ’³ Credit Card Fraud Detection with Random Forest
17
+
18
+ ## πŸ“š Project Description
19
+
20
+ This project detects fraudulent credit card transactions using a supervised machine learning approach. The dataset is highly imbalanced, making it a real-world anomaly detection problem. We trained a **Random Forest Classifier** optimized for performance and robustness.
21
+
22
+ ---
23
+
24
+ ## πŸ“ Dataset Overview
25
+
26
+ - **Source**: [Kaggle - Credit Card Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
27
+ - **Description**: Transactions made by European cardholders in September 2013.
28
+ - **Total Samples**: 284,807 transactions
29
+ - **Fraudulent Cases**: 492 (~0.172%)
30
+ - **Features**:
31
+ - `Time`: Time elapsed from the first transaction
32
+ - `Amount`: Transaction amount
33
+ - `V1` to `V28`: Principal components (PCA-transformed)
34
+ - `Class`: Target (0 = Legitimate, 1 = Fraudulent)
35
+
36
+ ---
37
+
38
+ ## 🧠 Model Used
39
+
40
+ ### `RandomForestClassifier` Configuration:
41
+
42
+ ```python
43
+ from sklearn.ensemble import RandomForestClassifier
44
+
45
+ rfc = RandomForestClassifier(
46
+ n_estimators=500,
47
+ max_depth=20,
48
+ min_samples_split=2,
49
+ min_samples_leaf=1,
50
+ max_features='sqrt',
51
+ bootstrap=True,
52
+ random_state=42,
53
+ n_jobs=-1
54
+ )
55
+ ```
56
+
57
+ ---
58
+
59
+ ## πŸ“Š Model Evaluation Metrics
60
+
61
+ | Metric | Value |
62
+ |----------------------------------|-----------|
63
+ | **Accuracy** | 0.9996 |
64
+ | **Precision** | 0.9747 |
65
+ | **Recall (Sensitivity)** | 0.7857 |
66
+ | **F1 Score** | 0.8701 |
67
+ | **Matthews Correlation Coefficient (MCC)** | 0.8749 |
68
+
69
+ πŸ“Œ **Interpretation**:
70
+ - **High accuracy** is expected due to class imbalance.
71
+ - **Precision** is high: most predicted frauds are true frauds.
72
+ - **Recall** is moderate: some frauds are missed.
73
+ - **F1 score** balances precision and recall.
74
+ - **MCC** gives a reliable measure even with class imbalance.
75
+
76
+ ---
77
+
78
+ ## ⏱️ Performance Timing
79
+
80
+ | Phase | Time (seconds) |
81
+ |--------------------|----------------|
82
+ | Training | 375.41 |
83
+ | Prediction | 0.94 |
84
+
85
+ ---
86
+
87
+ ## πŸ“¦ Exported Artifacts
88
+
89
+ - `random_forest_model_fraud_classification.pkl`: Trained Random Forest model
90
+ - `features.json`: Feature list used during training
91
+
92
+ ---
93
+
94
+ ## πŸš€ Usage Guide
95
+
96
+ ### 1️⃣ Install Dependencies
97
+
98
+ ```bash
99
+ pip install pandas scikit-learn joblib
100
+ ```
101
+
102
+ ---
103
+
104
+ ### 2️⃣ Load Model and Features
105
+
106
+ ```python
107
+ import joblib
108
+ import json
109
+ import pandas as pd
110
+
111
+ # Load the trained model
112
+ model = joblib.load("random_forest_model_fraud_classification.pkl")
113
+
114
+ # Load the feature list
115
+ with open("features.json", "r") as f:
116
+ features = json.load(f)
117
+ ```
118
+
119
+ ---
120
+
121
+ ### 3️⃣ Prepare Input Data
122
+
123
+ ```python
124
+ # Load your new transaction data
125
+ df = pd.read_csv("your_new_transactions.csv")
126
+
127
+ # Filter to keep only relevant features
128
+ df = df[features]
129
+ ```
130
+
131
+ ---
132
+
133
+ ### 4️⃣ Make Predictions
134
+
135
+ ```python
136
+ # Predict classes
137
+ predictions = model.predict(df)
138
+
139
+ # Predict fraud probability
140
+ probabilities = model.predict_proba(df)[:, 1]
141
+
142
+ print(predictions)
143
+ print(probabilities)
144
+ ```
145
+
146
+ ---
147
+
148
+ ## πŸ“Œ Notes
149
+
150
+ - Due to the **high class imbalance**, precision and recall should always be monitored.
151
+ - Adjust the decision threshold to optimize for recall or precision depending on your business needs.
152
+ - The model generalizes well but should be retrained periodically with new data.
153
+
154
+ ---
155
+
156
+ ## πŸ™ Acknowledgements
157
+
158
+ - Dataset provided by ULB & Worldline
159
+ - Original research: *Dal Pozzolo et al.*
160
+ - [Credit Card Fraud Detection - Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
161
+
162
+ ---
163
+
164
+ ## πŸ“ƒ License
165
+
166
+ MIT License – free to use, modify, and distribute with attribution.