Abs6187 commited on
Commit
c5ec08c
·
verified ·
1 Parent(s): b78fe73

Upload 12 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ engineered_data.csv filter=lfs diff=lfs merge=lfs -text
37
+ preprocessed_data.csv filter=lfs diff=lfs merge=lfs -text
38
+ uploaded_data.csv filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,13 +1,13 @@
1
  ---
2
- title: Fraud Detection API Excecute4 Part2
3
- emoji: 😻
4
- colorFrom: red
5
  colorTo: yellow
6
  sdk: streamlit
7
  sdk_version: 1.43.2
8
  app_file: app.py
9
  pinned: false
10
- short_description: SabPaisa_financial_frauds
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: Financial Fraud Detection
3
+ emoji: 👁
4
+ colorFrom: yellow
5
  colorTo: yellow
6
  sdk: streamlit
7
  sdk_version: 1.43.2
8
  app_file: app.py
9
  pinned: false
10
+ short_description: Detects Financial Frauds
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py ADDED
@@ -0,0 +1,1807 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Financial Fraud Detection System - TechMatrix Solvers
3
+ Team Members:
4
+ - Abhay Gupta
5
+ - Jay Kumar
6
+ - Kripanshu Gupta
7
+ - Bhumika Patel
8
+
9
+ A comprehensive fraud detection system using machine learning algorithms.
10
+ """
11
+
12
+ import streamlit as st
13
+ import pandas as pd
14
+ import numpy as np
15
+ import matplotlib.pyplot as plt
16
+ import seaborn as sns
17
+ import plotly.express as px
18
+ import plotly.graph_objects as go
19
+ import os
20
+ import pickle
21
+ import time
22
+ import warnings
23
+ from sklearn.preprocessing import StandardScaler
24
+ from sklearn.model_selection import train_test_split
25
+ from sklearn.linear_model import LogisticRegression
26
+ from sklearn.ensemble import RandomForestClassifier
27
+ from xgboost import XGBClassifier
28
+ from sklearn.metrics import (
29
+ accuracy_score, precision_score, recall_score, f1_score,
30
+ roc_auc_score, confusion_matrix, classification_report, roc_curve
31
+ )
32
+ from imblearn.over_sampling import SMOTE
33
+
34
+ # Suppress warnings
35
+ warnings.filterwarnings('ignore')
36
+
37
+ # Set page configuration
38
+ st.set_page_config(
39
+ page_title="TechMatrix Fraud Detection System",
40
+ page_icon="🔒",
41
+ layout="wide",
42
+ initial_sidebar_state="collapsed"
43
+ )
44
+
45
+ # Custom CSS for better styling
46
+ st.markdown("""
47
+ <style>
48
+ /* Main theme colors */
49
+ :root {
50
+ --primary: #2E7D32;
51
+ --primary-light: #81C784;
52
+ --primary-dark: #1B5E20;
53
+ --secondary: #1976D2;
54
+ --secondary-light: #64B5F6;
55
+ --text-on-primary: #FFFFFF;
56
+ --text-primary: #212121;
57
+ --text-secondary: #757575;
58
+ --background: #F5F5F5;
59
+ --card-bg: #FFFFFF;
60
+ --success: #43A047;
61
+ --warning: #FFA000;
62
+ --error: #D32F2F;
63
+ --info: #1976D2;
64
+ }
65
+
66
+ /* Base styles */
67
+ .main-header {
68
+ font-size: 2.8rem;
69
+ color: var(--primary);
70
+ text-align: center;
71
+ margin-bottom: 1.5rem;
72
+ font-weight: 700;
73
+ background: linear-gradient(90deg, var(--primary), var(--secondary));
74
+ -webkit-background-clip: text;
75
+ -webkit-text-fill-color: transparent;
76
+ padding: 0.5rem 0;
77
+ }
78
+
79
+ .sub-header {
80
+ font-size: 2rem;
81
+ color: var(--primary-dark);
82
+ margin-top: 2rem;
83
+ margin-bottom: 1rem;
84
+ font-weight: 600;
85
+ border-bottom: 2px solid var(--primary-light);
86
+ padding-bottom: 0.5rem;
87
+ }
88
+
89
+ .metric-card {
90
+ text-align: center;
91
+ padding: 1.2rem;
92
+ border-radius: 0.8rem;
93
+ background-color: rgba(46, 125, 50, 0.1);
94
+ transition: transform 0.3s ease;
95
+ border-left: 4px solid var(--primary);
96
+ }
97
+
98
+ .metric-card:hover {
99
+ transform: translateY(-5px);
100
+ background-color: rgba(46, 125, 50, 0.15);
101
+ }
102
+
103
+ .metric-value {
104
+ font-size: 2.5rem;
105
+ font-weight: 700;
106
+ color: var(--primary);
107
+ margin: 0.5rem 0;
108
+ }
109
+
110
+ .metric-label {
111
+ font-size: 1rem;
112
+ color: var(--text-secondary);
113
+ margin-bottom: 0.5rem;
114
+ }
115
+
116
+ div[data-testid="stMetric"] {
117
+ background-color: rgba(46, 125, 50, 0.1);
118
+ padding: 1rem;
119
+ border-radius: 0.8rem;
120
+ border-left: 4px solid var(--primary);
121
+ transition: transform 0.3s ease;
122
+ }
123
+
124
+ div[data-testid="stMetric"]:hover {
125
+ transform: translateY(-5px);
126
+ background-color: rgba(46, 125, 50, 0.15);
127
+ }
128
+
129
+ div[data-testid="stMetric"] > div {
130
+ gap: 0.2rem;
131
+ }
132
+
133
+ div[data-testid="stMetric"] label {
134
+ color: var(--text-secondary) !important;
135
+ }
136
+
137
+ div[data-testid="stMetric"] .css-1wivap2 {
138
+ color: var(--primary) !important;
139
+ }
140
+
141
+ .stButton > button {
142
+ background-color: var(--primary);
143
+ color: var(--text-on-primary);
144
+ border-radius: 0.5rem;
145
+ padding: 0.5rem 1rem;
146
+ font-weight: 600;
147
+ border: none;
148
+ transition: all 0.3s ease;
149
+ }
150
+
151
+ .stButton > button:hover {
152
+ background-color: var(--primary-dark);
153
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);
154
+ transform: translateY(-2px);
155
+ }
156
+
157
+ .stProgress > div > div > div {
158
+ background-color: var(--primary);
159
+ background-image: linear-gradient(45deg,
160
+ rgba(255,255,255,.15) 25%,
161
+ transparent 25%,
162
+ transparent 50%,
163
+ rgba(255,255,255,.15) 50%,
164
+ rgba(255,255,255,.15) 75%,
165
+ transparent 75%,
166
+ transparent
167
+ );
168
+ background-size: 1rem 1rem;
169
+ animation: progress-animation 1s linear infinite;
170
+ }
171
+
172
+ @keyframes progress-animation {
173
+ 0% { background-position: 0 0; }
174
+ 100% { background-position: 1rem 0; }
175
+ }
176
+
177
+ .success-text {
178
+ color: var(--success);
179
+ font-weight: bold;
180
+ }
181
+
182
+ .warning-text {
183
+ color: var(--warning);
184
+ font-weight: bold;
185
+ }
186
+
187
+ .error-text {
188
+ color: var(--error);
189
+ font-weight: bold;
190
+ }
191
+
192
+ .info-text {
193
+ color: var(--info);
194
+ font-weight: bold;
195
+ }
196
+
197
+ @keyframes fadeIn {
198
+ from { opacity: 0; }
199
+ to { opacity: 1; }
200
+ }
201
+
202
+ .animate-fade-in {
203
+ animation: fadeIn 0.8s ease-in-out;
204
+ }
205
+
206
+ [data-testid="stSidebarNav"] ul li:nth-child(2) {
207
+ display: none;
208
+ }
209
+
210
+ .dataframe {
211
+ border-collapse: collapse;
212
+ border: none;
213
+ font-size: 0.9rem;
214
+ }
215
+
216
+ .dataframe th {
217
+ background-color: var(--primary-light);
218
+ color: var(--text-primary);
219
+ padding: 0.5rem;
220
+ text-align: left;
221
+ }
222
+
223
+ .dataframe td {
224
+ padding: 0.5rem;
225
+ border-bottom: 1px solid #eee;
226
+ }
227
+
228
+ .dataframe tr:hover {
229
+ background-color: #f5f5f5;
230
+ }
231
+
232
+ .stSlider > div > div {
233
+ background-color: var(--primary-light);
234
+ }
235
+
236
+ .stSelectbox > div > div {
237
+ background-color: var(--card-bg);
238
+ border-radius: 0.5rem;
239
+ border: 1px solid var(--primary-light);
240
+ }
241
+
242
+ @keyframes pulse {
243
+ 0% { opacity: 0.6; }
244
+ 50% { opacity: 1; }
245
+ 100% { opacity: 0.6; }
246
+ }
247
+
248
+ .loading-pulse {
249
+ animation: pulse 1.5s infinite ease-in-out;
250
+ }
251
+ </style>
252
+ """, unsafe_allow_html=True)
253
+
254
+ # Create necessary directories
255
+ os.makedirs("data", exist_ok=True)
256
+ os.makedirs("models", exist_ok=True)
257
+
258
+ # Initialize session state
259
+ if 'current_page' not in st.session_state:
260
+ st.session_state['current_page'] = 'home'
261
+
262
+ if 'data' not in st.session_state:
263
+ st.session_state['data'] = None
264
+
265
+ if 'preprocessed_data' not in st.session_state:
266
+ st.session_state['preprocessed_data'] = None
267
+
268
+ if 'engineered_data' not in st.session_state:
269
+ st.session_state['engineered_data'] = None
270
+
271
+ if 'target_col' not in st.session_state:
272
+ st.session_state['target_col'] = 'Class'
273
+
274
+ if 'trained_models' not in st.session_state:
275
+ st.session_state['trained_models'] = {}
276
+
277
+ if 'predictions' not in st.session_state:
278
+ st.session_state['predictions'] = None
279
+
280
+ if 'progress' not in st.session_state:
281
+ st.session_state['progress'] = 0
282
+
283
+ # Main title
284
+ st.markdown("<div class='animate-fade-in'><h1 class='main-header'>TechMatrix Fraud Detection System</h1></div>", unsafe_allow_html=True)
285
+
286
+ # Team information
287
+ st.markdown("""
288
+ <div style='text-align: center; margin-bottom: 2rem;'>
289
+ <h3>Team TechMatrix Solvers</h3>
290
+ <p>Abhay Gupta | Jay Kumar | Kripanshu Gupta | Bhumika Patel</p>
291
+ </div>
292
+ """, unsafe_allow_html=True)
293
+
294
+ # Home Page
295
+ if st.session_state['current_page'] == 'home':
296
+ # Introduction section
297
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Welcome to TechMatrix Fraud Detection System</h2></div>", unsafe_allow_html=True)
298
+
299
+ col1, col2 = st.columns([2, 1])
300
+
301
+ with col1:
302
+ st.markdown("""
303
+ Our advanced fraud detection system leverages cutting-edge machine learning algorithms to identify and prevent fraudulent transactions in real-time.
304
+
305
+ ### Understanding Financial Fraud
306
+
307
+ Financial fraud encompasses various deceptive practices aimed at unauthorized acquisition of funds or assets.
308
+ Our system specifically addresses:
309
+ - Credit card transaction fraud
310
+ - Identity theft incidents
311
+ - Account compromise attempts
312
+ - Suspicious transaction patterns
313
+
314
+ ### Machine Learning Implementation
315
+
316
+ Our system employs sophisticated machine learning models that analyze transaction patterns and behavioral data.
317
+ The models are trained on historical fraud data and continuously updated to adapt to emerging fraud patterns.
318
+
319
+ ### System Advantages:
320
+ - **Real-time Monitoring**: Instant detection of suspicious activities
321
+ - **Scalable Processing**: Efficient handling of large transaction volumes
322
+ - **Pattern Recognition**: Advanced detection of complex fraud patterns
323
+ - **Risk Assessment**: Probability-based fraud scoring system
324
+ """)
325
+
326
+ with col2:
327
+ # Create a unique visualization of the fraud detection process
328
+ fig = go.Figure()
329
+
330
+ # Create a hexagonal flow diagram
331
+ angles = np.linspace(0, 2*np.pi, 6, endpoint=False)
332
+ x = 0.5 + 0.4 * np.cos(angles)
333
+ y = 0.5 + 0.4 * np.sin(angles)
334
+
335
+ # Add connecting lines with gradient effect
336
+ for i in range(len(angles)):
337
+ next_i = (i + 1) % len(angles)
338
+ fig.add_trace(go.Scatter(
339
+ x=[x[i], x[next_i]],
340
+ y=[y[i], y[next_i]],
341
+ mode='lines',
342
+ line=dict(
343
+ color='rgba(46, 125, 50, 0.5)',
344
+ width=2,
345
+ dash='dot'
346
+ ),
347
+ showlegend=False
348
+ ))
349
+
350
+ # Add nodes with updated colors and labels
351
+ node_labels = ['Input Data', 'Validation', 'Processing', 'Analysis', 'Detection', 'Action']
352
+ node_colors = ['#2E7D32', '#43A047', '#81C784', '#1976D2', '#64B5F6', '#D32F2F']
353
+
354
+ for i in range(len(angles)):
355
+ fig.add_trace(go.Scatter(
356
+ x=[x[i]],
357
+ y=[y[i]],
358
+ mode='markers+text',
359
+ marker=dict(
360
+ size=30,
361
+ color=node_colors[i],
362
+ symbol='hexagon'
363
+ ),
364
+ text=node_labels[i],
365
+ textposition="middle center",
366
+ textfont=dict(color='white', size=12),
367
+ showlegend=False
368
+ ))
369
+
370
+ # Add title in the center with updated styling
371
+ fig.add_trace(go.Scatter(
372
+ x=[0.5],
373
+ y=[0.5],
374
+ mode='text',
375
+ text='Fraud<br>Detection<br>Pipeline',
376
+ textposition="middle center",
377
+ textfont=dict(
378
+ color='#212121',
379
+ size=14,
380
+ family='Arial, bold'
381
+ ),
382
+ showlegend=False
383
+ ))
384
+
385
+ fig.update_layout(
386
+ height=400,
387
+ width=400,
388
+ margin=dict(l=0, r=0, t=0, b=0),
389
+ xaxis=dict(
390
+ showgrid=False,
391
+ zeroline=False,
392
+ showticklabels=False,
393
+ range=[0, 1]
394
+ ),
395
+ yaxis=dict(
396
+ showgrid=False,
397
+ zeroline=False,
398
+ showticklabels=False,
399
+ range=[0, 1]
400
+ ),
401
+ plot_bgcolor='rgba(0,0,0,0)'
402
+ )
403
+
404
+ st.plotly_chart(fig)
405
+
406
+ # Workflow section
407
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>System Workflow</h2></div>", unsafe_allow_html=True)
408
+
409
+ col1, col2, col3, col4 = st.columns(4)
410
+
411
+ with col1:
412
+ st.markdown("### 1. Data Ingestion")
413
+ st.markdown("Secure upload and validation of transaction data in CSV format.")
414
+ st.image("https://cdn-icons-png.flaticon.com/512/4208/4208479.png", width=100)
415
+
416
+ with col2:
417
+ st.markdown("### 2. Data Processing")
418
+ st.markdown("Advanced data cleaning and preparation for analysis.")
419
+ st.image("https://cdn-icons-png.flaticon.com/512/1875/1875627.png", width=100)
420
+
421
+ with col3:
422
+ st.markdown("### 3. Feature Extraction")
423
+ st.markdown("Intelligent feature engineering and pattern recognition.")
424
+ st.image("https://cdn-icons-png.flaticon.com/512/2103/2103633.png", width=100)
425
+
426
+ with col4:
427
+ st.markdown("### 4. Model Deployment")
428
+ st.markdown("Real-time fraud detection and risk assessment.")
429
+ st.image("https://cdn-icons-png.flaticon.com/512/2103/2103658.png", width=100)
430
+
431
+ # Sample visualizations section
432
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>System Analytics</h2></div>", unsafe_allow_html=True)
433
+
434
+ col1, col2 = st.columns(2)
435
+
436
+ with col1:
437
+ # Sample ROC curve with improved styling
438
+ fig = go.Figure()
439
+ fpr = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
440
+ tpr_lr = [0, 0.4, 0.55, 0.68, 0.75, 0.8, 0.85, 0.9, 0.94, 0.98, 1.0]
441
+ tpr_rf = [0, 0.5, 0.65, 0.78, 0.85, 0.88, 0.91, 0.95, 0.97, 0.99, 1.0]
442
+ tpr_xgb = [0, 0.55, 0.7, 0.8, 0.87, 0.9, 0.93, 0.96, 0.98, 0.99, 1.0]
443
+
444
+ fig.add_trace(go.Scatter(
445
+ x=fpr,
446
+ y=tpr_lr,
447
+ mode='lines',
448
+ name='Logistic Regression (AUC = 0.85)',
449
+ line=dict(color='#2E7D32', width=3)
450
+ ))
451
+ fig.add_trace(go.Scatter(
452
+ x=fpr,
453
+ y=tpr_rf,
454
+ mode='lines',
455
+ name='Random Forest (AUC = 0.92)',
456
+ line=dict(color='#1976D2', width=3)
457
+ ))
458
+ fig.add_trace(go.Scatter(
459
+ x=fpr,
460
+ y=tpr_xgb,
461
+ mode='lines',
462
+ name='XGBoost (AUC = 0.94)',
463
+ line=dict(color='#D32F2F', width=3)
464
+ ))
465
+ fig.add_trace(go.Scatter(
466
+ x=[0, 1],
467
+ y=[0, 1],
468
+ mode='lines',
469
+ name='Random',
470
+ line=dict(dash='dash', color='#757575', width=2)
471
+ ))
472
+
473
+ fig.update_layout(
474
+ title='Model Performance Comparison',
475
+ xaxis_title='False Positive Rate',
476
+ yaxis_title='True Positive Rate',
477
+ legend=dict(x=0.01, y=0.99),
478
+ width=600,
479
+ height=400,
480
+ template='plotly_white',
481
+ margin=dict(l=40, r=40, t=40, b=40)
482
+ )
483
+
484
+ st.plotly_chart(fig)
485
+
486
+ with col2:
487
+ # Sample feature importance with improved styling
488
+ features = ['Transaction Amount', 'Time of Day', 'Merchant Category', 'Location', 'Transaction Frequency',
489
+ 'Device Used', 'IP Address', 'Account Age', 'Previous Fraud Flag', 'Transaction Type']
490
+ importance = [0.23, 0.18, 0.15, 0.12, 0.09, 0.08, 0.06, 0.04, 0.03, 0.02]
491
+
492
+ fig = px.bar(
493
+ x=importance,
494
+ y=features,
495
+ orientation='h',
496
+ title='Feature Importance Analysis',
497
+ labels={'x': 'Importance Score', 'y': 'Feature'},
498
+ color=importance,
499
+ color_continuous_scale=['#2E7D32', '#43A047', '#81C784']
500
+ )
501
+
502
+ fig.update_layout(
503
+ width=600,
504
+ height=400,
505
+ template='plotly_white',
506
+ margin=dict(l=40, r=40, t=40, b=40)
507
+ )
508
+ st.plotly_chart(fig)
509
+
510
+ # Get started button
511
+ st.markdown("<div style='text-align: center; margin-top: 2rem;'>", unsafe_allow_html=True)
512
+ if st.button("Get Started", key="get_started", help="Begin the fraud detection process"):
513
+ st.session_state['current_page'] = 'upload'
514
+ st.rerun()
515
+ st.markdown("</div>", unsafe_allow_html=True)
516
+
517
+ # Data Upload Page
518
+ elif st.session_state['current_page'] == 'upload':
519
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Step 1: Data Ingestion</h2></div>", unsafe_allow_html=True)
520
+
521
+ # File uploader with size limit warning
522
+ st.markdown("""
523
+ ### Secure Data Upload
524
+
525
+ Upload your transaction data securely in CSV format. The system supports the following:
526
+
527
+ - Transaction details (amount, timestamp, location, etc.)
528
+ - Target column for fraud classification (default: 'Class' with 0 for normal, 1 for fraud)
529
+ - **Maximum file size: 200 MB**
530
+
531
+ For testing purposes, you can use the [Credit Card Fraud Detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud) from Kaggle.
532
+
533
+ ### Data Requirements:
534
+ - CSV format with UTF-8 encoding
535
+ - No missing values in critical fields
536
+ - Proper date/time formatting
537
+ - Numeric values for transaction amounts
538
+ """)
539
+
540
+ uploaded_file = st.file_uploader(
541
+ "Upload transaction data (CSV file)",
542
+ type="csv",
543
+ help="Maximum file size: 200 MB"
544
+ )
545
+
546
+ if uploaded_file is not None:
547
+ # Check file size (200 MB limit)
548
+ file_details = {"FileName": uploaded_file.name, "FileType": uploaded_file.type}
549
+
550
+ # Read the file into a buffer to check its size
551
+ file_buffer = uploaded_file.getvalue()
552
+ file_size_mb = len(file_buffer) / (1024 * 1024)
553
+
554
+ if file_size_mb > 200:
555
+ st.error(f"File size exceeds the 200 MB limit. Your file is {file_size_mb:.2f} MB. Please upload a smaller file.")
556
+ st.stop()
557
+ else:
558
+ st.info(f"File size: {file_size_mb:.2f} MB")
559
+
560
+ # Load data with progress bar
561
+ progress_bar = st.progress(0)
562
+ status_text = st.empty()
563
+
564
+ status_text.text("Initializing data ingestion...")
565
+ progress_bar.progress(25)
566
+ time.sleep(0.3)
567
+
568
+ try:
569
+ # Use BytesIO to avoid loading the file twice
570
+ from io import BytesIO
571
+ df = pd.read_csv(BytesIO(file_buffer))
572
+ st.session_state['data'] = df
573
+
574
+ progress_bar.progress(50)
575
+ status_text.text("Validating data structure...")
576
+ time.sleep(0.3)
577
+
578
+ progress_bar.progress(75)
579
+ status_text.text("Preparing data preview...")
580
+ time.sleep(0.3)
581
+
582
+ progress_bar.progress(100)
583
+ status_text.text("Data ingestion completed!")
584
+ time.sleep(0.3)
585
+
586
+ status_text.empty()
587
+ progress_bar.empty()
588
+
589
+ # Show basic data info
590
+ st.success(f"Data ingested successfully! Shape: {df.shape[0]} rows and {df.shape[1]} columns")
591
+
592
+ col1, col2 = st.columns(2)
593
+
594
+ with col1:
595
+ st.subheader("Data Preview")
596
+ st.dataframe(df.head())
597
+
598
+ with col2:
599
+ st.subheader("Data Structure")
600
+
601
+ # Display data types and missing values
602
+ data_info = pd.DataFrame({
603
+ 'Data Type': df.dtypes,
604
+ 'Non-Null Count': df.count(),
605
+ 'Missing Values': df.isnull().sum(),
606
+ 'Unique Values': [df[col].nunique() for col in df.columns]
607
+ })
608
+
609
+ st.dataframe(data_info)
610
+
611
+ # Check for target column
612
+ if 'Class' in df.columns:
613
+ fraud_count = df['Class'].sum()
614
+ total_count = len(df)
615
+ fraud_percentage = (fraud_count / total_count) * 100
616
+
617
+ st.info(f"Target column 'Class' detected with {fraud_count} fraud cases ({fraud_percentage:.2f}% of data)")
618
+ else:
619
+ st.warning("No 'Class' column detected. You'll need to specify the target column in the next step.")
620
+ except Exception as e:
621
+ st.error(f"Error during data ingestion: {str(e)}")
622
+ st.info("Please ensure the file is a valid CSV with proper formatting.")
623
+
624
+ # Navigation buttons
625
+ col1, col2 = st.columns([1, 5])
626
+
627
+ with col1:
628
+ if st.button("← Back to Home", key="back_to_home"):
629
+ st.session_state['current_page'] = 'home'
630
+ st.rerun()
631
+
632
+ with col2:
633
+ if st.session_state['data'] is not None:
634
+ if st.button("Continue to Data Processing →", key="to_preprocess"):
635
+ st.session_state['current_page'] = 'preprocess'
636
+ st.rerun()
637
+
638
+ # Data Preprocessing Page
639
+ elif st.session_state['current_page'] == 'preprocess':
640
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Step 2: Data Processing</h2></div>", unsafe_allow_html=True)
641
+
642
+ if st.session_state['data'] is None:
643
+ st.error("No data found. Please upload data first.")
644
+ if st.button("Go back to Data Ingestion"):
645
+ st.session_state['current_page'] = 'upload'
646
+ st.rerun()
647
+ else:
648
+ df = st.session_state['data']
649
+
650
+ st.markdown("""
651
+ ### Advanced Data Processing
652
+
653
+ Enhance your data quality through our comprehensive processing pipeline. The system will:
654
+ - Handle missing values intelligently
655
+ - Remove statistical outliers
656
+ - Normalize numerical features
657
+ - Balance class distribution
658
+
659
+ Select the processing options below to customize the pipeline.
660
+ """)
661
+
662
+ # Target column selection
663
+ if 'Class' in df.columns:
664
+ target_col = 'Class'
665
+ st.info(f"Target column 'Class' detected with values: {df[target_col].unique()}")
666
+ else:
667
+ target_col = st.selectbox("Select the target column (fraud indicator)", df.columns)
668
+
669
+ st.session_state['target_col'] = target_col
670
+
671
+ # Preprocessing options
672
+ st.subheader("Processing Options")
673
+
674
+ col1, col2 = st.columns(2)
675
+
676
+ with col1:
677
+ handle_missing = st.checkbox("Handle Missing Values", value=True,
678
+ help="Fill missing numerical values with mean and categorical values with mode")
679
+ remove_outliers = st.checkbox("Remove Outliers", value=False,
680
+ help="Remove extreme values that might affect model performance")
681
+
682
+ with col2:
683
+ normalize_data = st.checkbox("Normalize Data", value=True,
684
+ help="Scale numerical features to have zero mean and unit variance")
685
+ balance_classes = st.checkbox("Balance Classes", value=True,
686
+ help="Handle class imbalance using SMOTE in the training phase")
687
+
688
+ # Handle missing values
689
+ if st.button("Process Data"):
690
+ with st.spinner("Processing data..."):
691
+ # Create a copy of the dataframe
692
+ df_processed = df.copy()
693
+
694
+ # Progress bar
695
+ progress_bar = st.progress(0)
696
+ status_text = st.empty()
697
+
698
+ # Handle missing values
699
+ if handle_missing:
700
+ status_text.text("Processing missing values...")
701
+ progress_bar.progress(25)
702
+ time.sleep(0.3)
703
+
704
+ for col in df_processed.columns:
705
+ if df_processed[col].dtype in ['int64', 'float64']:
706
+ df_processed[col] = df_processed[col].fillna(df_processed[col].mean())
707
+ else:
708
+ df_processed[col] = df_processed[col].fillna(df_processed[col].mode()[0])
709
+
710
+ # Remove outliers if selected
711
+ if remove_outliers:
712
+ status_text.text("Processing outliers...")
713
+ progress_bar.progress(50)
714
+ time.sleep(0.3)
715
+
716
+ # Only apply to numerical columns
717
+ num_cols = df_processed.select_dtypes(include=['int64', 'float64']).columns
718
+ for col in num_cols:
719
+ if col != target_col: # Don't remove outliers from target column
720
+ Q1 = df_processed[col].quantile(0.25)
721
+ Q3 = df_processed[col].quantile(0.75)
722
+ IQR = Q3 - Q1
723
+ lower_bound = Q1 - 3 * IQR
724
+ upper_bound = Q3 + 3 * IQR
725
+ df_processed = df_processed[(df_processed[col] >= lower_bound) &
726
+ (df_processed[col] <= upper_bound)]
727
+
728
+ # Store the processed data
729
+ status_text.text("Finalizing data processing...")
730
+ progress_bar.progress(100)
731
+ time.sleep(0.3)
732
+
733
+ st.session_state['preprocessed_data'] = df_processed
734
+
735
+ status_text.empty()
736
+ progress_bar.empty()
737
+
738
+ st.success("Data processing completed!")
739
+
740
+ # Show class distribution
741
+ if target_col in df_processed.columns:
742
+ st.subheader("Class Distribution After Processing")
743
+
744
+ col1, col2 = st.columns(2)
745
+
746
+ with col1:
747
+ # Create pie chart with improved styling
748
+ labels = ['Normal', 'Fraud']
749
+ values = [len(df_processed[df_processed[target_col] == 0]),
750
+ len(df_processed[df_processed[target_col] == 1])]
751
+
752
+ fig = px.pie(
753
+ values=values,
754
+ names=labels,
755
+ title='Transaction Distribution',
756
+ color_discrete_sequence=['#2E7D32', '#D32F2F'],
757
+ hole=0.4
758
+ )
759
+
760
+ fig.update_traces(textposition='inside', textinfo='percent+label')
761
+ fig.update_layout(
762
+ template='plotly_white',
763
+ margin=dict(l=20, r=20, t=30, b=20)
764
+ )
765
+ st.plotly_chart(fig)
766
+
767
+ with col2:
768
+ # Calculate statistics
769
+ fraud_count = df_processed[target_col].sum()
770
+ total_count = len(df_processed)
771
+ fraud_percentage = (fraud_count / total_count) * 100
772
+
773
+ st.metric("Total Transactions", f"{total_count:,}")
774
+ st.metric("Fraud Transactions", f"{fraud_count:,}")
775
+ st.metric("Fraud Percentage", f"{fraud_percentage:.2f}%")
776
+
777
+ if fraud_percentage < 1:
778
+ st.warning("Your dataset is highly imbalanced. Class balancing will be applied during model training.")
779
+
780
+ # Navigation buttons
781
+ col1, col2 = st.columns([1, 5])
782
+
783
+ with col1:
784
+ if st.button("← Back to Upload", key="back_to_upload"):
785
+ st.session_state['current_page'] = 'upload'
786
+ st.rerun()
787
+
788
+ with col2:
789
+ if st.session_state['preprocessed_data'] is not None:
790
+ if st.button("Continue to Feature Extraction →", key="to_feature_eng"):
791
+ st.session_state['current_page'] = 'feature_engineering'
792
+ st.rerun()
793
+
794
+ # Feature Engineering Page
795
+ elif st.session_state['current_page'] == 'feature_engineering':
796
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Step 3: Feature Extraction</h2></div>", unsafe_allow_html=True)
797
+
798
+ if st.session_state['preprocessed_data'] is None:
799
+ st.error("No processed data found. Please complete data processing first.")
800
+ if st.button("Go back to Data Processing"):
801
+ st.session_state['current_page'] = 'preprocess'
802
+ st.rerun()
803
+ else:
804
+ df_processed = st.session_state['preprocessed_data']
805
+ target_col = st.session_state['target_col']
806
+
807
+ st.markdown("""
808
+ ### Intelligent Feature Extraction
809
+
810
+ Enhance your fraud detection capabilities through advanced feature engineering. Our system provides:
811
+ - Time-based pattern analysis
812
+ - Transaction amount profiling
813
+ - Behavioral feature extraction
814
+ - Cross-feature interaction analysis
815
+
816
+ Select the features to extract below to optimize your model's performance.
817
+ """)
818
+
819
+ # Feature engineering options
820
+ st.subheader("Feature Extraction Options")
821
+
822
+ col1, col2 = st.columns(2)
823
+
824
+ with col1:
825
+ create_time_features = st.checkbox("Time-based Features", value=True,
826
+ help="Extract temporal patterns and behavioral indicators")
827
+ create_amount_features = st.checkbox("Amount-based Features", value=True,
828
+ help="Generate transaction amount profiles and risk indicators")
829
+
830
+ with col2:
831
+ create_aggregations = st.checkbox("Aggregation Features", value=False,
832
+ help="Create aggregated metrics for transaction patterns")
833
+ create_interactions = st.checkbox("Interaction Features", value=False,
834
+ help="Generate cross-feature interactions for complex pattern detection")
835
+
836
+ # Apply feature engineering
837
+ if st.button("Extract Features"):
838
+ with st.spinner("Extracting features..."):
839
+ # Create a copy of the dataframe
840
+ df_engineered = df_processed.copy()
841
+
842
+ # Progress bar
843
+ progress_bar = st.progress(0)
844
+ status_text = st.empty()
845
+
846
+ # Time-based features
847
+ if create_time_features and 'Time' in df_engineered.columns:
848
+ status_text.text("Extracting temporal features...")
849
+ progress_bar.progress(25)
850
+ time.sleep(0.3)
851
+
852
+ # Hour of day
853
+ df_engineered['Hour'] = (df_engineered['Time'] / 3600) % 24
854
+
855
+ # Flag for transactions during odd hours (midnight to 5 AM)
856
+ df_engineered['Odd_Hour'] = ((df_engineered['Hour'] >= 0) & (df_engineered['Hour'] < 5)).astype(int)
857
+
858
+ # Part of day
859
+ df_engineered['Part_of_Day'] = pd.cut(
860
+ df_engineered['Hour'],
861
+ bins=[0, 6, 12, 18, 24],
862
+ labels=['Night', 'Morning', 'Afternoon', 'Evening']
863
+ )
864
+
865
+ # Amount-based features
866
+ if create_amount_features and 'Amount' in df_engineered.columns:
867
+ status_text.text("Extracting amount-based features...")
868
+ progress_bar.progress(50)
869
+ time.sleep(0.3)
870
+
871
+ # Log transform for amount (to handle skewed distribution)
872
+ df_engineered['Log_Amount'] = np.log1p(df_engineered['Amount'])
873
+
874
+ # Flag for high-value transactions (top 5%)
875
+ threshold = df_engineered['Amount'].quantile(0.95)
876
+ df_engineered['High_Value'] = (df_engineered['Amount'] > threshold).astype(int)
877
+
878
+ # Amount bins
879
+ df_engineered['Amount_Bin'] = pd.qcut(
880
+ df_engineered['Amount'],
881
+ q=5,
882
+ labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
883
+ )
884
+
885
+ # Aggregation features
886
+ if create_aggregations:
887
+ status_text.text("Generating aggregation features...")
888
+ progress_bar.progress(75)
889
+ time.sleep(0.3)
890
+
891
+ # Check if there's a card ID or similar column
892
+ potential_id_cols = [col for col in df_engineered.columns if 'id' in col.lower() or 'card' in col.lower()]
893
+
894
+ if potential_id_cols:
895
+ id_col = potential_id_cols[0]
896
+
897
+ # Number of transactions per card
898
+ tx_count = df_engineered.groupby(id_col).size().reset_index(name='Tx_Count')
899
+ df_engineered = df_engineered.merge(tx_count, on=id_col, how='left')
900
+
901
+ # Average transaction amount per card
902
+ if 'Amount' in df_engineered.columns:
903
+ avg_amount = df_engineered.groupby(id_col)['Amount'].mean().reset_index(name='Avg_Amount')
904
+ df_engineered = df_engineered.merge(avg_amount, on=id_col, how='left')
905
+
906
+ # Transaction amount deviation from average
907
+ df_engineered['Amount_Deviation'] = df_engineered['Amount'] - df_engineered['Avg_Amount']
908
+
909
+ # Interaction features
910
+ if create_interactions:
911
+ status_text.text("Generating interaction features...")
912
+ progress_bar.progress(90)
913
+ time.sleep(0.3)
914
+
915
+ # Only create interactions between numerical features
916
+ num_cols = df_engineered.select_dtypes(include=['int64', 'float64']).columns
917
+ num_cols = [col for col in num_cols if col != target_col and 'id' not in col.lower()]
918
+
919
+ # Limit to a few important features to avoid explosion of features
920
+ if len(num_cols) > 3:
921
+ num_cols = num_cols[:3]
922
+
923
+ # Create interactions
924
+ for i in range(len(num_cols)):
925
+ for j in range(i+1, len(num_cols)):
926
+ col1_name = num_cols[i]
927
+ col2_name = num_cols[j]
928
+ df_engineered[f'{col1_name}_x_{col2_name}'] = df_engineered[col1_name] * df_engineered[col2_name]
929
+
930
+ # Convert categorical columns to one-hot encoding
931
+ cat_cols = df_engineered.select_dtypes(include=['object', 'category']).columns
932
+ for col in cat_cols:
933
+ dummies = pd.get_dummies(df_engineered[col], prefix=col, drop_first=True)
934
+ df_engineered = pd.concat([df_engineered, dummies], axis=1)
935
+ df_engineered.drop(columns=[col], inplace=True)
936
+
937
+ # Store the engineered data
938
+ status_text.text("Finalizing feature extraction...")
939
+ progress_bar.progress(100)
940
+ time.sleep(0.3)
941
+
942
+ st.session_state['engineered_data'] = df_engineered
943
+
944
+ status_text.empty()
945
+ progress_bar.empty()
946
+
947
+ st.success("Feature extraction completed!")
948
+
949
+ # Show correlation with target
950
+ if target_col in df_engineered.columns:
951
+ st.subheader("Feature Correlation Analysis")
952
+
953
+ # Get correlation with target
954
+ corr_with_target = df_engineered.corr()[target_col].sort_values(ascending=False)
955
+
956
+ # Remove target's correlation with itself
957
+ corr_with_target = corr_with_target.drop(target_col)
958
+
959
+ # Get top 10 positive and negative correlations
960
+ top_pos = corr_with_target.head(10)
961
+ top_neg = corr_with_target.tail(10).iloc[::-1] # Reverse to show strongest negative first
962
+
963
+ col1, col2 = st.columns(2)
964
+
965
+ with col1:
966
+ # Plot top positive correlations with improved styling
967
+ fig = px.bar(
968
+ x=top_pos.values,
969
+ y=top_pos.index,
970
+ orientation='h',
971
+ title='Top Positive Correlations with Fraud',
972
+ labels={'x': 'Correlation', 'y': 'Feature'},
973
+ color=top_pos.values,
974
+ color_continuous_scale=['#2E7D32', '#43A047', '#81C784']
975
+ )
976
+
977
+ fig.update_layout(
978
+ height=400,
979
+ template='plotly_white',
980
+ margin=dict(l=20, r=20, t=40, b=20)
981
+ )
982
+ st.plotly_chart(fig)
983
+
984
+ with col2:
985
+ # Plot top negative correlations with improved styling
986
+ fig = px.bar(
987
+ x=top_neg.values,
988
+ y=top_neg.index,
989
+ orientation='h',
990
+ title='Top Negative Correlations with Fraud',
991
+ labels={'x': 'Correlation', 'y': 'Feature'},
992
+ color=top_neg.values,
993
+ color_continuous_scale=['#81C784', '#43A047', '#2E7D32']
994
+ )
995
+
996
+ fig.update_layout(
997
+ height=400,
998
+ template='plotly_white',
999
+ margin=dict(l=20, r=20, t=40, b=20)
1000
+ )
1001
+ st.plotly_chart(fig)
1002
+
1003
+ # Correlation heatmap
1004
+ st.subheader("Feature Correlation Matrix")
1005
+
1006
+ # Get top correlated features
1007
+ corr_matrix = df_engineered.corr()
1008
+ top_corr_features = corr_with_target.abs().sort_values(ascending=False).head(15).index
1009
+
1010
+ # Create heatmap with selected features
1011
+ top_corr_matrix = corr_matrix.loc[top_corr_features, top_corr_features]
1012
+
1013
+ fig = px.imshow(
1014
+ top_corr_matrix,
1015
+ text_auto='.2f',
1016
+ color_continuous_scale=['#2E7D32', 'white', '#1976D2'],
1017
+ title='Feature Correlation Matrix'
1018
+ )
1019
+
1020
+ fig.update_layout(
1021
+ height=600,
1022
+ width=800,
1023
+ template='plotly_white',
1024
+ margin=dict(l=20, r=20, t=40, b=20)
1025
+ )
1026
+ st.plotly_chart(fig)
1027
+
1028
+ # Feature distributions
1029
+ st.subheader("Feature Distribution Analysis")
1030
+
1031
+ # Select a feature to visualize
1032
+ numeric_cols = df_engineered.select_dtypes(include=['int64', 'float64']).columns
1033
+ numeric_cols = [col for col in numeric_cols if col != target_col]
1034
+
1035
+ selected_feature = st.selectbox("Select feature to analyze", numeric_cols)
1036
+
1037
+ # Create distribution plot with improved styling
1038
+ fig = px.histogram(
1039
+ df_engineered,
1040
+ x=selected_feature,
1041
+ color=target_col,
1042
+ marginal="box",
1043
+ opacity=0.7,
1044
+ barmode="overlay",
1045
+ color_discrete_map={0: "#2E7D32", 1: "#D32F2F"},
1046
+ labels={target_col: "Class", "0": "Normal", "1": "Fraud"}
1047
+ )
1048
+
1049
+ fig.update_layout(
1050
+ title=f"Distribution Analysis of {selected_feature}",
1051
+ template='plotly_white',
1052
+ margin=dict(l=20, r=20, t=40, b=20)
1053
+ )
1054
+ st.plotly_chart(fig)
1055
+
1056
+ # Navigation buttons
1057
+ col1, col2 = st.columns([1, 5])
1058
+
1059
+ with col1:
1060
+ if st.button("← Back to Processing", key="back_to_preprocess"):
1061
+ st.session_state['current_page'] = 'preprocess'
1062
+ st.rerun()
1063
+
1064
+ with col2:
1065
+ if st.session_state['engineered_data'] is not None:
1066
+ if st.button("Continue to Model Training →", key="to_model_training"):
1067
+ st.session_state['current_page'] = 'model_training'
1068
+ st.rerun()
1069
+
1070
+ # Model Training Page
1071
+ elif st.session_state['current_page'] == 'model_training':
1072
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Step 4: Model Training</h2></div>", unsafe_allow_html=True)
1073
+
1074
+ if st.session_state['engineered_data'] is None:
1075
+ st.error("No engineered data found. Please complete feature extraction first.")
1076
+ if st.button("Go back to Feature Extraction"):
1077
+ st.session_state['current_page'] = 'feature_engineering'
1078
+ st.rerun()
1079
+ else:
1080
+ df_engineered = st.session_state['engineered_data']
1081
+ target_col = st.session_state['target_col']
1082
+
1083
+ st.markdown("""
1084
+ ### Advanced Model Training
1085
+
1086
+ Train sophisticated machine learning models for fraud detection. Our system provides:
1087
+ - Multiple model architectures
1088
+ - Automated hyperparameter optimization
1089
+ - Cross-validation for robust evaluation
1090
+ - Performance metrics visualization
1091
+
1092
+ Select your preferred models and training parameters below.
1093
+ """)
1094
+
1095
+ # Training options
1096
+ st.subheader("Training Configuration")
1097
+
1098
+ col1, col2 = st.columns(2)
1099
+
1100
+ with col1:
1101
+ # Data sampling for faster training - default to a smaller sample for speed
1102
+ use_sample = st.checkbox("Use Data Sample for Faster Training", value=True,
1103
+ help="Use a sample of the data to speed up training (recommended for large datasets)")
1104
+
1105
+ if use_sample:
1106
+ sample_size = st.slider("Sample Size (%)", min_value=10, max_value=100, value=20,
1107
+ help="Percentage of data to use for training")
1108
+
1109
+ # Test size
1110
+ test_size = st.slider("Test Set Size (%)", min_value=10, max_value=50, value=20,
1111
+ help="Percentage of data to use for testing")
1112
+
1113
+ # Class balancing
1114
+ use_smote = st.checkbox("Apply SMOTE for Class Balancing", value=True,
1115
+ help="Use SMOTE to handle class imbalance")
1116
+
1117
+ with col2:
1118
+ # Model selection
1119
+ st.write("Select Models to Train:")
1120
+ train_lr = st.checkbox("Logistic Regression", value=True)
1121
+ train_rf = st.checkbox("Random Forest", value=True)
1122
+ train_xgb = st.checkbox("XGBoost", value=True)
1123
+
1124
+ # Advanced options - reduced default values for faster training
1125
+ show_advanced = st.checkbox("Show Advanced Options", value=False)
1126
+
1127
+ if show_advanced:
1128
+ # Number of estimators for tree models - reduced for speed
1129
+ n_estimators = st.slider("Number of Estimators", min_value=10, max_value=200, value=50,
1130
+ help="Number of trees for Random Forest and XGBoost (higher = more accurate but slower)")
1131
+
1132
+ # Max depth for tree models
1133
+ max_depth = st.slider("Max Tree Depth", min_value=3, max_value=15, value=6,
1134
+ help="Maximum depth of trees (higher = more complex model)")
1135
+
1136
+ # Start training
1137
+ if st.button("Train Models"):
1138
+ with st.spinner("Training models..."):
1139
+ status_container = st.empty()
1140
+ status_container.markdown(
1141
+ '<div class="loading-pulse">Training in progress... This may take a few minutes.</div>',
1142
+ unsafe_allow_html=True
1143
+ )
1144
+ # Prepare data for training
1145
+ X = df_engineered.drop(columns=[target_col])
1146
+ y = df_engineered[target_col]
1147
+
1148
+ # Use sample if selected
1149
+ if use_sample and sample_size < 100:
1150
+ sample_frac = sample_size / 100
1151
+ # Stratified sampling to maintain class distribution
1152
+ X_sample = pd.DataFrame()
1153
+ y_sample = pd.Series()
1154
+
1155
+ for class_value in y.unique():
1156
+ X_class = X[y == class_value]
1157
+ y_class = y[y == class_value]
1158
+
1159
+ n_samples = int(len(X_class) * sample_frac)
1160
+ indices = np.random.choice(X_class.index, size=n_samples, replace=False)
1161
+
1162
+ X_sample = pd.concat([X_sample, X_class.loc[indices]])
1163
+ y_sample = pd.concat([y_sample, y_class.loc[indices]])
1164
+
1165
+ X = X_sample
1166
+ y = y_sample
1167
+
1168
+ # Progress bar
1169
+ progress_bar = st.progress(0)
1170
+ status_text = st.empty()
1171
+
1172
+ status_text.text("Preparing training data...")
1173
+ progress_bar.progress(10)
1174
+
1175
+ # Split data
1176
+ X_train, X_test, y_train, y_test = train_test_split(
1177
+ X, y, test_size=test_size/100, random_state=42, stratify=y
1178
+ )
1179
+
1180
+ status_text.text("Scaling features...")
1181
+ progress_bar.progress(20)
1182
+
1183
+ # Scale features
1184
+ scaler = StandardScaler()
1185
+ X_train_scaled = scaler.fit_transform(X_train)
1186
+ X_test_scaled = scaler.transform(X_test)
1187
+
1188
+ # Handle class imbalance with SMOTE if selected
1189
+ if use_smote:
1190
+ status_text.text("Applying SMOTE for class balancing...")
1191
+ progress_bar.progress(30)
1192
+
1193
+ smote = SMOTE(random_state=42)
1194
+ X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
1195
+ else:
1196
+ X_train_resampled, y_train_resampled = X_train_scaled, y_train
1197
+
1198
+ # Save preprocessor
1199
+ with open("models/scaler.pkl", "wb") as f:
1200
+ pickle.dump(scaler, f)
1201
+
1202
+ # Save feature columns
1203
+ with open("models/feature_columns.pkl", "wb") as f:
1204
+ pickle.dump(X.columns.tolist(), f)
1205
+
1206
+ # Initialize results list
1207
+ results = []
1208
+ trained_models = {}
1209
+
1210
+ # Train selected models
1211
+ if train_lr:
1212
+ status_text.text("Training Logistic Regression...")
1213
+ progress_bar.progress(40)
1214
+
1215
+ # Train Logistic Regression
1216
+ lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')
1217
+ lr_model.fit(X_train_resampled, y_train_resampled)
1218
+
1219
+ # Make predictions
1220
+ y_pred = lr_model.predict(X_test_scaled)
1221
+ y_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]
1222
+
1223
+ # Calculate metrics
1224
+ accuracy = accuracy_score(y_test, y_pred)
1225
+ precision = precision_score(y_test, y_pred)
1226
+ recall = recall_score(y_test, y_pred)
1227
+ f1 = f1_score(y_test, y_pred)
1228
+ auc = roc_auc_score(y_test, y_pred_proba)
1229
+ cm = confusion_matrix(y_test, y_pred)
1230
+
1231
+ # Store results
1232
+ lr_results = {
1233
+ 'model_name': 'Logistic Regression',
1234
+ 'model': lr_model,
1235
+ 'accuracy': accuracy,
1236
+ 'precision': precision,
1237
+ 'recall': recall,
1238
+ 'f1_score': f1,
1239
+ 'auc': auc,
1240
+ 'confusion_matrix': cm,
1241
+ 'y_test': y_test,
1242
+ 'y_pred_proba': y_pred_proba
1243
+ }
1244
+
1245
+ results.append(lr_results)
1246
+ trained_models['lr'] = lr_model
1247
+
1248
+ # Save model
1249
+ with open("models/logistic_regression.pkl", "wb") as f:
1250
+ pickle.dump(lr_model, f)
1251
+
1252
+ if train_rf:
1253
+ status_text.text("Training Random Forest...")
1254
+ progress_bar.progress(60)
1255
+
1256
+ # Get parameters - use smaller values for speed
1257
+ n_est = n_estimators if show_advanced else 50
1258
+ m_depth = max_depth if show_advanced else 6
1259
+
1260
+ # Train Random Forest
1261
+ rf_model = RandomForestClassifier(
1262
+ n_estimators=n_est,
1263
+ max_depth=m_depth,
1264
+ class_weight='balanced',
1265
+ random_state=42
1266
+ )
1267
+ rf_model.fit(X_train_resampled, y_train_resampled)
1268
+
1269
+ # Make predictions
1270
+ y_pred = rf_model.predict(X_test_scaled)
1271
+ y_pred_proba = rf_model.predict_proba(X_test_scaled)[:, 1]
1272
+
1273
+ # Calculate metrics
1274
+ accuracy = accuracy_score(y_test, y_pred)
1275
+ precision = precision_score(y_test, y_pred)
1276
+ recall = recall_score(y_test, y_pred)
1277
+ f1 = f1_score(y_test, y_pred)
1278
+ auc = roc_auc_score(y_test, y_pred_proba)
1279
+ cm = confusion_matrix(y_test, y_pred)
1280
+
1281
+ # Store results
1282
+ rf_results = {
1283
+ 'model_name': 'Random Forest',
1284
+ 'model': rf_model,
1285
+ 'accuracy': accuracy,
1286
+ 'precision': precision,
1287
+ 'recall': recall,
1288
+ 'f1_score': f1,
1289
+ 'auc': auc,
1290
+ 'confusion_matrix': cm,
1291
+ 'y_test': y_test,
1292
+ 'y_pred_proba': y_pred_proba
1293
+ }
1294
+
1295
+ results.append(rf_results)
1296
+ trained_models['rf'] = rf_model
1297
+
1298
+ # Save model
1299
+ with open("models/random_forest.pkl", "wb") as f:
1300
+ pickle.dump(rf_model, f)
1301
+
1302
+ if train_xgb:
1303
+ status_text.text("Training XGBoost...")
1304
+ progress_bar.progress(80)
1305
+
1306
+ # Get parameters - use smaller values for speed
1307
+ n_est = n_estimators if show_advanced else 50
1308
+ m_depth = max_depth if show_advanced else 6
1309
+
1310
+ # Train XGBoost
1311
+ xgb_model = XGBClassifier(
1312
+ n_estimators=n_est,
1313
+ max_depth=m_depth,
1314
+ scale_pos_weight=10,
1315
+ random_state=42,
1316
+ use_label_encoder=False,
1317
+ eval_metric='logloss'
1318
+ )
1319
+ xgb_model.fit(X_train_resampled, y_train_resampled)
1320
+
1321
+ # Make predictions
1322
+ y_pred = xgb_model.predict(X_test_scaled)
1323
+ y_pred_proba = xgb_model.predict_proba(X_test_scaled)[:, 1]
1324
+
1325
+ # Calculate metrics
1326
+ accuracy = accuracy_score(y_test, y_pred)
1327
+ precision = precision_score(y_test, y_pred)
1328
+ recall = recall_score(y_test, y_pred)
1329
+ f1 = f1_score(y_test, y_pred)
1330
+ auc = roc_auc_score(y_test, y_pred_proba)
1331
+ cm = confusion_matrix(y_test, y_pred)
1332
+
1333
+ # Store results
1334
+ xgb_results = {
1335
+ 'model_name': 'XGBoost',
1336
+ 'model': xgb_model,
1337
+ 'accuracy': accuracy,
1338
+ 'precision': precision,
1339
+ 'recall': recall,
1340
+ 'f1_score': f1,
1341
+ 'auc': auc,
1342
+ 'confusion_matrix': cm,
1343
+ 'y_test': y_test,
1344
+ 'y_pred_proba': y_pred_proba
1345
+ }
1346
+
1347
+ results.append(xgb_results)
1348
+ trained_models['xgb'] = xgb_model
1349
+
1350
+ # Save model
1351
+ with open("models/xgboost.pkl", "wb") as f:
1352
+ pickle.dump(xgb_model, f)
1353
+
1354
+ # Save test data
1355
+ with open("models/test_data.pkl", "wb") as f:
1356
+ pickle.dump({"X_test": X_test_scaled, "y_test": y_test}, f)
1357
+
1358
+ st.session_state['trained_models'] = trained_models
1359
+
1360
+ # Automatically make predictions on the original dataset
1361
+ status_text.text("Generating predictions...")
1362
+ progress_bar.progress(90)
1363
+
1364
+ # Find the best model based on F1 score (good for imbalanced data)
1365
+ best_model = None
1366
+ best_f1 = -1
1367
+ best_model_name = ""
1368
+
1369
+ for result in results:
1370
+ if result['f1_score'] > best_f1:
1371
+ best_f1 = result['f1_score']
1372
+ best_model = result['model']
1373
+ best_model_name = result['model_name']
1374
+
1375
+ if best_model is not None:
1376
+ # Prepare full dataset for prediction
1377
+ X_full = df_engineered.drop(columns=[target_col])
1378
+
1379
+ # Scale the data
1380
+ X_full_scaled = scaler.transform(X_full)
1381
+
1382
+ # Make predictions
1383
+ y_pred = best_model.predict(X_full_scaled)
1384
+ y_pred_proba = best_model.predict_proba(X_full_scaled)[:, 1]
1385
+
1386
+ # Add predictions to the dataframe
1387
+ df_with_predictions = df_engineered.copy()
1388
+ df_with_predictions['Fraud_Probability'] = y_pred_proba
1389
+ df_with_predictions['Predicted_Fraud'] = y_pred
1390
+
1391
+ # Store predictions
1392
+ st.session_state['predictions'] = {
1393
+ 'df': df_with_predictions,
1394
+ 'model_name': best_model_name,
1395
+ 'results': results
1396
+ }
1397
+
1398
+ status_text.text("Training completed!")
1399
+ progress_bar.progress(100)
1400
+ time.sleep(0.3)
1401
+
1402
+ status_text.empty()
1403
+ progress_bar.empty()
1404
+
1405
+ st.success("Models trained successfully!")
1406
+
1407
+ # Display comparison of results
1408
+ if results:
1409
+ st.subheader("Model Performance Analysis")
1410
+
1411
+ # Create comparison table
1412
+ comparison_df = pd.DataFrame([
1413
+ {
1414
+ 'Model': r['model_name'],
1415
+ 'Accuracy': r['accuracy'],
1416
+ 'Precision': r['precision'],
1417
+ 'Recall': r['recall'],
1418
+ 'F1 Score': r['f1_score'],
1419
+ 'AUC': r['auc']
1420
+ } for r in results
1421
+ ])
1422
+
1423
+ st.dataframe(comparison_df.style.highlight_max(axis=0, color='#81C784'))
1424
+
1425
+ # Plot metrics comparison with improved styling
1426
+ fig = px.bar(
1427
+ comparison_df.melt(id_vars=['Model'], var_name='Metric', value_name='Value'),
1428
+ x='Model',
1429
+ y='Value',
1430
+ color='Metric',
1431
+ barmode='group',
1432
+ title='Model Performance Comparison',
1433
+ labels={'Value': 'Score', 'Model': 'Model'},
1434
+ color_discrete_sequence=['#2E7D32', '#43A047', '#81C784', '#1976D2', '#D32F2F']
1435
+ )
1436
+
1437
+ fig.update_layout(
1438
+ height=500,
1439
+ template='plotly_white',
1440
+ margin=dict(l=20, r=20, t=40, b=20)
1441
+ )
1442
+ st.plotly_chart(fig)
1443
+
1444
+ # Plot ROC curves with improved styling
1445
+ st.subheader("ROC Curve Analysis")
1446
+
1447
+ fig = go.Figure()
1448
+
1449
+ colors = ['#2E7D32', '#1976D2', '#D32F2F']
1450
+
1451
+ for i, result in enumerate(results):
1452
+ model_name = result['model_name']
1453
+ y_test = result['y_test']
1454
+ y_pred_proba = result['y_pred_proba']
1455
+
1456
+ fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
1457
+ auc = result['auc']
1458
+
1459
+ fig.add_trace(go.Scatter(
1460
+ x=fpr,
1461
+ y=tpr,
1462
+ mode='lines',
1463
+ name=f'{model_name} (AUC = {auc:.3f})',
1464
+ line=dict(color=colors[i % len(colors)], width=3)
1465
+ ))
1466
+
1467
+ fig.add_trace(go.Scatter(
1468
+ x=[0, 1],
1469
+ y=[0, 1],
1470
+ mode='lines',
1471
+ name='Random',
1472
+ line=dict(dash='dash', color='#757575', width=2)
1473
+ ))
1474
+
1475
+ fig.update_layout(
1476
+ title='ROC Curve Analysis',
1477
+ xaxis_title='False Positive Rate',
1478
+ yaxis_title='True Positive Rate',
1479
+ legend=dict(x=0.01, y=0.99),
1480
+ height=500,
1481
+ template='plotly_white',
1482
+ margin=dict(l=20, r=20, t=40, b=20)
1483
+ )
1484
+
1485
+ st.plotly_chart(fig)
1486
+
1487
+ # Show confusion matrices with improved styling
1488
+ st.subheader("Confusion Matrix Analysis")
1489
+
1490
+ cols = st.columns(len(results))
1491
+
1492
+ for i, result in enumerate(results):
1493
+ with cols[i]:
1494
+ model_name = result['model_name']
1495
+ cm = result['confusion_matrix']
1496
+
1497
+ # Calculate percentages
1498
+ cm_percent = cm / cm.sum()
1499
+
1500
+ # Create annotation text
1501
+ annotations = []
1502
+ for i in range(cm.shape[0]):
1503
+ for j in range(cm.shape[1]):
1504
+ annotations.append({
1505
+ 'x': j,
1506
+ 'y': i,
1507
+ 'text': f"{cm[i, j]}<br>({cm_percent[i, j]:.1%})",
1508
+ 'showarrow': False,
1509
+ 'font': {'color': 'white' if cm_percent[i, j] > 0.5 else 'black'}
1510
+ })
1511
+
1512
+ # Create heatmap
1513
+ fig = go.Figure(data=go.Heatmap(
1514
+ z=cm,
1515
+ x=['Predicted Normal', 'Predicted Fraud'],
1516
+ y=['Actual Normal', 'Actual Fraud'],
1517
+ colorscale=[[0, '#81C784'], [1, '#2E7D32']],
1518
+ showscale=False
1519
+ ))
1520
+
1521
+ fig.update_layout(
1522
+ title=f"{model_name}",
1523
+ annotations=annotations,
1524
+ height=300,
1525
+ template='plotly_white',
1526
+ margin=dict(l=20, r=20, t=40, b=20)
1527
+ )
1528
+
1529
+ st.plotly_chart(fig)
1530
+
1531
+ # Feature importance for tree-based models with improved styling
1532
+ st.subheader("Feature Importance Analysis")
1533
+
1534
+ for result in results:
1535
+ model_name = result['model_name']
1536
+ model = result['model']
1537
+
1538
+ if model_name in ['Random Forest', 'XGBoost']:
1539
+ # Get feature importance
1540
+ if hasattr(model, 'feature_importances_'):
1541
+ importances = model.feature_importances_
1542
+ feature_names = X.columns
1543
+
1544
+ # Sort by importance
1545
+ indices = np.argsort(importances)[::-1]
1546
+ top_indices = indices[:10] # Show top 10 features for speed
1547
+
1548
+ # Create bar chart
1549
+ fig = px.bar(
1550
+ x=importances[top_indices],
1551
+ y=[feature_names[i] for i in top_indices],
1552
+ orientation='h',
1553
+ title=f'Top Features - {model_name}',
1554
+ labels={'x': 'Importance', 'y': 'Feature'},
1555
+ color=importances[top_indices],
1556
+ color_continuous_scale=['#81C784', '#43A047', '#2E7D32']
1557
+ )
1558
+
1559
+ fig.update_layout(
1560
+ height=400,
1561
+ template='plotly_white',
1562
+ margin=dict(l=20, r=20, t=40, b=20)
1563
+ )
1564
+ st.plotly_chart(fig)
1565
+
1566
+ # Navigation buttons
1567
+ col1, col2 = st.columns([1, 5])
1568
+
1569
+ with col1:
1570
+ if st.button("← Back to Feature Extraction", key="back_to_feature_eng"):
1571
+ st.session_state['current_page'] = 'feature_engineering'
1572
+ st.rerun()
1573
+
1574
+ with col2:
1575
+ if st.session_state['predictions'] is not None:
1576
+ if st.button("Continue to Results →", key="to_results"):
1577
+ st.session_state['current_page'] = 'results'
1578
+ st.rerun()
1579
+
1580
+ # Fraud Detection Results Page
1581
+ elif st.session_state['current_page'] == 'results':
1582
+ st.markdown("<div class='animate-fade-in'><h2 class='sub-header'>Step 5: Fraud Detection Results</h2></div>", unsafe_allow_html=True)
1583
+
1584
+ if st.session_state['predictions'] is None:
1585
+ st.error("No predictions found. Please complete model training first.")
1586
+ if st.button("Go back to Model Training"):
1587
+ st.session_state['current_page'] = 'model_training'
1588
+ st.rerun()
1589
+ else:
1590
+ predictions = st.session_state['predictions']
1591
+ df_with_predictions = predictions['df']
1592
+ model_name = predictions['model_name']
1593
+
1594
+ st.markdown(f"<h3 class='sub-header'>Fraud Detection Results using {model_name}</h3>", unsafe_allow_html=True)
1595
+
1596
+ # Summary of predictions
1597
+ fraud_count = df_with_predictions['Predicted_Fraud'].sum()
1598
+ total_count = len(df_with_predictions)
1599
+ fraud_percentage = (fraud_count / total_count) * 100
1600
+
1601
+ # Create metrics display with improved styling
1602
+ col1, col2, col3 = st.columns(3)
1603
+
1604
+ with col1:
1605
+ st.metric(
1606
+ label="Total Transactions",
1607
+ value=f"{total_count:,}",
1608
+ delta=None
1609
+ )
1610
+
1611
+ with col2:
1612
+ st.metric(
1613
+ label="Detected Frauds",
1614
+ value=f"{fraud_count:,}",
1615
+ delta=None
1616
+ )
1617
+
1618
+ with col3:
1619
+ st.metric(
1620
+ label="Fraud Percentage",
1621
+ value=f"{fraud_percentage:.2f}%",
1622
+ delta=None
1623
+ )
1624
+
1625
+ # Visualization of fraud distribution with improved styling
1626
+ st.subheader("Fraud Probability Distribution")
1627
+
1628
+ fig = px.histogram(
1629
+ df_with_predictions,
1630
+ x='Fraud_Probability',
1631
+ nbins=50,
1632
+ color='Predicted_Fraud',
1633
+ color_discrete_map={0: "#6200EA", 1: "#D50000"},
1634
+ labels={'Predicted_Fraud': 'Prediction', '0': 'Normal', '1': 'Fraud'},
1635
+ title='Distribution of Fraud Probabilities',
1636
+ marginal='box'
1637
+ )
1638
+
1639
+ fig.update_layout(
1640
+ height=500,
1641
+ template='plotly_white',
1642
+ margin=dict(l=20, r=20, t=40, b=20)
1643
+ )
1644
+ st.plotly_chart(fig)
1645
+
1646
+ # Show high probability transactions
1647
+ st.subheader("High Fraud Probability Transactions")
1648
+
1649
+ # Slider for probability threshold
1650
+ threshold = st.slider(
1651
+ "Fraud Probability Threshold",
1652
+ min_value=0.5,
1653
+ max_value=0.95,
1654
+ value=0.7,
1655
+ step=0.05,
1656
+ help="Transactions with fraud probability above this threshold will be shown"
1657
+ )
1658
+
1659
+ high_prob_df = df_with_predictions[df_with_predictions['Fraud_Probability'] > threshold]
1660
+
1661
+ if len(high_prob_df) > 0:
1662
+ st.write(f"Found {len(high_prob_df)} transactions with fraud probability > {threshold}")
1663
+
1664
+ # Sort by probability
1665
+ high_prob_df = high_prob_df.sort_values('Fraud_Probability', ascending=False)
1666
+
1667
+ # Select columns to display
1668
+ display_cols = ['Fraud_Probability', 'Predicted_Fraud']
1669
+
1670
+ # Add original features
1671
+ if 'Amount' in high_prob_df.columns:
1672
+ display_cols.insert(0, 'Amount')
1673
+
1674
+ if 'Time' in high_prob_df.columns:
1675
+ display_cols.insert(0, 'Time')
1676
+
1677
+ # Add target column if it exists
1678
+ if st.session_state['target_col'] in high_prob_df.columns:
1679
+ display_cols.append(st.session_state['target_col'])
1680
+
1681
+ # Display dataframe
1682
+ st.dataframe(high_prob_df[display_cols])
1683
+
1684
+ # Download button
1685
+ csv = high_prob_df.to_csv(index=False)
1686
+ st.download_button(
1687
+ label="Download High Risk Transactions",
1688
+ data=csv,
1689
+ file_name="high_risk_transactions.csv",
1690
+ mime="text/csv"
1691
+ )
1692
+ else:
1693
+ st.info(f"No transactions found with fraud probability > {threshold}")
1694
+ # Show top 10 highest probability transactions instead
1695
+ st.write("Top 10 highest fraud probability transactions:")
1696
+ st.dataframe(df_with_predictions.sort_values('Fraud_Probability', ascending=False).head(10))
1697
+
1698
+ # Compare actual vs predicted (if actual labels exist)
1699
+ target_col = st.session_state['target_col']
1700
+ if target_col in df_with_predictions.columns:
1701
+ st.subheader("Actual vs Predicted Fraud")
1702
+
1703
+ # Confusion matrix with improved styling
1704
+ cm = confusion_matrix(df_with_predictions[target_col], df_with_predictions['Predicted_Fraud'])
1705
+
1706
+ # Calculate percentages
1707
+ cm_percent = cm / cm.sum()
1708
+
1709
+ # Create annotation text
1710
+ annotations = []
1711
+ for i in range(cm.shape[0]):
1712
+ for j in range(cm.shape[1]):
1713
+ annotations.append({
1714
+ 'x': j,
1715
+ 'y': i,
1716
+ 'text': f"{cm[i, j]}<br>({cm_percent[i, j]:.1%})",
1717
+ 'showarrow': False,
1718
+ 'font': {'color': 'white' if cm_percent[i, j] > 0.5 else 'black'}
1719
+ })
1720
+
1721
+ # Create heatmap
1722
+ fig = go.Figure(data=go.Heatmap(
1723
+ z=cm,
1724
+ x=['Predicted Normal', 'Predicted Fraud'],
1725
+ y=['Actual Normal', 'Actual Fraud'],
1726
+ colorscale=[[0, '#81C784'], [1, '#2E7D32']],
1727
+ showscale=False
1728
+ ))
1729
+
1730
+ fig.update_layout(
1731
+ title=f"Confusion Matrix - {model_name}",
1732
+ annotations=annotations,
1733
+ height=400,
1734
+ template='plotly_white',
1735
+ margin=dict(l=20, r=20, t=40, b=20)
1736
+ )
1737
+
1738
+ st.plotly_chart(fig)
1739
+
1740
+ # Calculate metrics
1741
+ accuracy = accuracy_score(df_with_predictions[target_col], df_with_predictions['Predicted_Fraud'])
1742
+
1743
+ # Calculate metrics
1744
+ precision = precision_score(df_with_predictions[target_col], df_with_predictions['Predicted_Fraud'])
1745
+ recall = recall_score(df_with_predictions[target_col], df_with_predictions['Predicted_Fraud'])
1746
+ f1 = f1_score(df_with_predictions[target_col], df_with_predictions['Predicted_Fraud'])
1747
+
1748
+ # Display metrics with improved styling
1749
+ st.subheader("Performance Metrics on Full Dataset")
1750
+
1751
+ col1, col2, col3, col4 = st.columns(4)
1752
+
1753
+ with col1:
1754
+ st.metric(
1755
+ label="Accuracy",
1756
+ value=f"{accuracy:.4f}",
1757
+ delta=None
1758
+ )
1759
+
1760
+ with col2:
1761
+ st.metric(
1762
+ label="Precision",
1763
+ value=f"{precision:.4f}",
1764
+ delta=None
1765
+ )
1766
+
1767
+ with col3:
1768
+ st.metric(
1769
+ label="Recall",
1770
+ value=f"{recall:.4f}",
1771
+ delta=None
1772
+ )
1773
+
1774
+ with col4:
1775
+ st.metric(
1776
+ label="F1 Score",
1777
+ value=f"{f1:.4f}",
1778
+ delta=None
1779
+ )
1780
+
1781
+ # Download all predictions
1782
+ st.subheader("Download Results")
1783
+
1784
+ csv = df_with_predictions.to_csv(index=False)
1785
+ st.download_button(
1786
+ label="Download All Predictions as CSV",
1787
+ data=csv,
1788
+ file_name="fraud_predictions.csv",
1789
+ mime="text/csv"
1790
+ )
1791
+
1792
+ # Navigation buttons
1793
+ col1, col2 = st.columns([1, 5])
1794
+
1795
+ with col1:
1796
+ if st.button("← Back to Model Training", key="back_to_model_training"):
1797
+ st.session_state['current_page'] = 'model_training'
1798
+ st.rerun()
1799
+
1800
+ with col2:
1801
+ if st.button("Start Over", key="start_over"):
1802
+ # Reset session state
1803
+ for key in list(st.session_state.keys()):
1804
+ del st.session_state[key]
1805
+ st.session_state['current_page'] = 'home'
1806
+ st.rerun()
1807
+
data_exploration.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # pages/data_exploration.py
2
+ import streamlit as st
3
+ import pandas as pd
4
+ import numpy as np
5
+ import matplotlib.pyplot as plt
6
+ import seaborn as sns
7
+ import os
8
+ from utils.data_processor import DataProcessor
9
+ from utils.visualizer import Visualizer
10
+
11
+ def app():
12
+ st.title("Data Exploration")
13
+
14
+ # Initialize classes
15
+ data_processor = DataProcessor()
16
+ visualizer = Visualizer()
17
+
18
+ # Load data function
19
+ @st.cache_data
20
+ def load_data():
21
+ # Check if data exists in the data directory
22
+ data_path = "data/creditcard.csv"
23
+ if os.path.exists(data_path):
24
+ return pd.read_csv(data_path)
25
+ else:
26
+ st.warning("Default dataset not found. Please upload a dataset.")
27
+ return None
28
+
29
+ # Load data
30
+ df = load_data()
31
+ if df is None:
32
+ uploaded_file = st.file_uploader("Upload a CSV file", type="csv")
33
+ if uploaded_file is not None:
34
+ df = pd.read_csv(uploaded_file)
35
+ df.to_csv("data/uploaded_data.csv", index=False)
36
+
37
+ if df is not None:
38
+ st.write(f"Dataset shape: {df.shape[0]} rows and {df.shape[1]} columns")
39
+
40
+ # Data overview
41
+ st.header("Data Overview")
42
+ st.write(df.head())
43
+
44
+ # Data information
45
+ st.header("Data Information")
46
+ buffer = pd.DataFrame({
47
+ 'Column': df.columns,
48
+ 'Type': df.dtypes,
49
+ 'Non-Null Count': df.count(),
50
+ 'Null Count': df.isnull().sum(),
51
+ 'Unique Values': [df[col].nunique() for col in df.columns]
52
+ })
53
+ st.write(buffer)
54
+
55
+ # Statistical summary
56
+ st.header("Statistical Summary")
57
+ st.write(df.describe())
58
+
59
+ # Class distribution
60
+ st.header("Class Distribution")
61
+ if 'Class' in df.columns:
62
+ fig = visualizer.plot_class_distribution(df)
63
+ st.pyplot(fig)
64
+
65
+ # Calculate fraud percentage
66
+ fraud_percentage = df['Class'].mean() * 100
67
+ st.write(f"Fraud transactions: {fraud_percentage:.2f}% of the dataset")
68
+ else:
69
+ st.warning("No 'Class' column found in the dataset. Please ensure your target variable is named 'Class'.")
70
+
71
+ # Feature distributions
72
+ st.header("Feature Distributions")
73
+ num_features = st.slider("Number of features to display", 1, min(10, len(df.columns)-1), 5)
74
+ fig = visualizer.plot_feature_distributions(df, n_features=num_features)
75
+ st.pyplot(fig)
76
+
77
+ # Correlation matrix
78
+ st.header("Correlation Matrix")
79
+ fig = visualizer.plot_correlation_matrix(df)
80
+ st.pyplot(fig)
81
+
82
+ # Transaction amount analysis
83
+ if 'Amount' in df.columns:
84
+ st.header("Transaction Amount Analysis")
85
+
86
+ col1, col2 = st.columns(2)
87
+
88
+ with col1:
89
+ st.subheader("Amount Distribution")
90
+ fig, ax = plt.subplots(figsize=(10, 6))
91
+ sns.histplot(data=df, x='Amount', bins=50, kde=True, ax=ax)
92
+ st.pyplot(fig)
93
+
94
+ with col2:
95
+ if 'Class' in df.columns:
96
+ st.subheader("Amount by Class")
97
+ fig, ax = plt.subplots(figsize=(10, 6))
98
+ sns.boxplot(x='Class', y='Amount', data=df, ax=ax)
99
+ st.pyplot(fig)
100
+
101
+ # Time analysis
102
+ if 'Time' in df.columns:
103
+ st.header("Transaction Time Analysis")
104
+
105
+ # Convert time to hours
106
+ df_time = df.copy()
107
+ df_time['Hour'] = (df_time['Time'] / 3600) % 24
108
+
109
+ fig, ax = plt.subplots(figsize=(12, 6))
110
+ if 'Class' in df.columns:
111
+ sns.histplot(data=df_time, x='Hour', hue='Class', bins=24, kde=True, ax=ax)
112
+ else:
113
+ sns.histplot(data=df_time, x='Hour', bins=24, kde=True, ax=ax)
114
+ plt.title('Transaction Distribution by Hour of Day')
115
+ plt.xlabel('Hour of Day')
116
+ plt.ylabel('Number of Transactions')
117
+ st.pyplot(fig)
118
+
119
+ # Feature analysis for fraud detection
120
+ if 'Class' in df.columns:
121
+ st.header("Feature Analysis for Fraud Detection")
122
+
123
+ # Select top features correlated with fraud
124
+ corr_with_fraud = df.corr()['Class'].sort_values(ascending=False)
125
+ top_features = corr_with_fraud[1:6].index.tolist() # Skip Class itself
126
+
127
+ st.subheader("Top Features Correlated with Fraud")
128
+ st.write(corr_with_fraud[1:11]) # Show top 10 correlations
129
+
130
+ # Plot distributions of top features by fraud/non-fraud
131
+ st.subheader("Distributions of Top Features by Class")
132
+ for feature in top_features:
133
+ fig, ax = plt.subplots(figsize=(10, 6))
134
+ sns.histplot(data=df, x=feature, hue='Class', bins=50, kde=True, ax=ax)
135
+ plt.title(f'Distribution of {feature} by Class')
136
+ st.pyplot(fig)
137
+
138
+ if __name__ == "__main__":
139
+ app()
data_processor.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import numpy as np
3
+ from sklearn.preprocessing import StandardScaler, OneHotEncoder
4
+ from sklearn.compose import ColumnTransformer
5
+ from sklearn.pipeline import Pipeline
6
+ from imblearn.over_sampling import SMOTE
7
+ from sklearn.model_selection import train_test_split
8
+ from sklearn import __version__ as sklearn_version
9
+ from packaging import version
10
+
11
+ class DataProcessor:
12
+ def __init__(self):
13
+ self.scaler = StandardScaler()
14
+
15
+ # Handle different scikit-learn versions
16
+ if version.parse(sklearn_version) >= version.parse('1.2.0'):
17
+ self.encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
18
+ else:
19
+ self.encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
20
+
21
+ def load_data(self, file_path):
22
+ """Load the dataset from a CSV file"""
23
+ try:
24
+ df = pd.read_csv(file_path)
25
+ return df
26
+ except Exception as e:
27
+ print(f"Error loading data: {e}")
28
+ return None
29
+
30
+ def preprocess_data(self, df, target_col='Class'):
31
+ """Preprocess the data for model training"""
32
+ # Handle missing values
33
+ df = df.fillna(df.mean())
34
+
35
+ # Separate features and target
36
+ X = df.drop(columns=[target_col])
37
+ y = df[target_col]
38
+
39
+ # Split data into train and test sets
40
+ X_train, X_test, y_train, y_test = train_test_split(
41
+ X, y, test_size=0.2, random_state=42, stratify=y
42
+ )
43
+
44
+ # Scale numerical features
45
+ num_features = X.select_dtypes(include=['int64', 'float64']).columns
46
+
47
+ # Get categorical features if any
48
+ cat_features = X.select_dtypes(include=['object', 'category']).columns
49
+
50
+ # Create preprocessing pipelines
51
+ if version.parse(sklearn_version) >= version.parse('1.2.0'):
52
+ preprocessor = ColumnTransformer(
53
+ transformers=[
54
+ ('num', StandardScaler(), num_features),
55
+ ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), cat_features)
56
+ ] if len(cat_features) > 0 else [
57
+ ('num', StandardScaler(), num_features)
58
+ ]
59
+ )
60
+ else:
61
+ preprocessor = ColumnTransformer(
62
+ transformers=[
63
+ ('num', StandardScaler(), num_features),
64
+ ('cat', OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_features)
65
+ ] if len(cat_features) > 0 else [
66
+ ('num', StandardScaler(), num_features)
67
+ ]
68
+ )
69
+
70
+ # Fit and transform the training data
71
+ X_train_processed = preprocessor.fit_transform(X_train)
72
+ X_test_processed = preprocessor.transform(X_test)
73
+
74
+ # Handle class imbalance using SMOTE
75
+ smote = SMOTE(random_state=42)
76
+ X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)
77
+
78
+ return X_train_resampled, X_test_processed, y_train_resampled, y_test, preprocessor
79
+
80
+ def engineer_features(self, df):
81
+ """Create new features for fraud detection"""
82
+ # Copy the dataframe to avoid modifying the original
83
+ df_new = df.copy()
84
+
85
+ # If Time column exists, create time-based features
86
+ if 'Time' in df_new.columns:
87
+ # Convert seconds to hours of the day (assuming Time is in seconds from a reference point)
88
+ df_new['Hour'] = (df_new['Time'] / 3600) % 24
89
+
90
+ # Flag for transactions during odd hours (midnight to 5 AM)
91
+ df_new['Odd_Hour'] = ((df_new['Hour'] >= 0) & (df_new['Hour'] < 5)).astype(int)
92
+
93
+ # If Amount column exists, create amount-based features
94
+ if 'Amount' in df_new.columns:
95
+ # Log transform for amount (to handle skewed distribution)
96
+ df_new['Log_Amount'] = np.log1p(df_new['Amount'])
97
+
98
+ # Flag for high-value transactions (top 5%)
99
+ threshold = df_new['Amount'].quantile(0.95)
100
+ df_new['High_Value'] = (df_new['Amount'] > threshold).astype(int)
101
+
102
+ # Transaction frequency features (if multiple transactions per account)
103
+ if 'card_id' in df_new.columns: # Assuming there's a card or account ID
104
+ # Number of transactions per card
105
+ tx_count = df_new.groupby('card_id').size().reset_index(name='Tx_Count')
106
+ df_new = df_new.merge(tx_count, on='card_id', how='left')
107
+
108
+ # Average transaction amount per card
109
+ avg_amount = df_new.groupby('card_id')['Amount'].mean().reset_index(name='Avg_Amount')
110
+ df_new = df_new.merge(avg_amount, on='card_id', how='left')
111
+
112
+ # Transaction amount deviation from average
113
+ df_new['Amount_Deviation'] = df_new['Amount'] - df_new['Avg_Amount']
114
+
115
+ return df_new
engineered_data.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bb2af06deaefb7427a0878982917cbf2ee8270aa79339730a01b1e1972a3c00
3
+ size 162508357
gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ engineered_data.csv filter=lfs diff=lfs merge=lfs -text
37
+ preprocessed_data.csv filter=lfs diff=lfs merge=lfs -text
38
+ uploaded_data.csv filter=lfs diff=lfs merge=lfs -text
gitkeep ADDED
File without changes
model_trainer.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # utils/model_trainer.py (updated)
2
+ import pandas as pd
3
+ import numpy as np
4
+ import pickle
5
+ from sklearn.linear_model import LogisticRegression
6
+ from sklearn.ensemble import RandomForestClassifier
7
+ from xgboost import XGBClassifier
8
+ from tensorflow.keras.models import Sequential
9
+ from tensorflow.keras.layers import Dense, Dropout
10
+ from sklearn.metrics import (
11
+ accuracy_score, precision_score, recall_score, f1_score,
12
+ roc_auc_score, confusion_matrix, classification_report
13
+ )
14
+ import matplotlib.pyplot as plt
15
+ import seaborn as sns
16
+ import warnings
17
+
18
+ # Suppress warnings
19
+ warnings.filterwarnings('ignore')
20
+
21
+ class ModelTrainer:
22
+ def __init__(self):
23
+ self.models = {
24
+ 'Logistic Regression': LogisticRegression(max_iter=1000, class_weight='balanced'),
25
+ 'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
26
+ 'XGBoost': XGBClassifier(scale_pos_weight=10, n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='logloss')
27
+ }
28
+ self.neural_net = None
29
+
30
+ def train_models(self, X_train, y_train):
31
+ """Train multiple machine learning models"""
32
+ trained_models = {}
33
+
34
+ for name, model in self.models.items():
35
+ print(f"Training {name}...")
36
+ model.fit(X_train, y_train)
37
+ trained_models[name] = model
38
+
39
+ return trained_models
40
+
41
+ def train_neural_network(self, X_train, y_train, input_dim):
42
+ """Train a neural network model"""
43
+ model = Sequential([
44
+ Dense(64, activation='relu', input_dim=input_dim),
45
+ Dropout(0.3),
46
+ Dense(32, activation='relu'),
47
+ Dropout(0.3),
48
+ Dense(16, activation='relu'),
49
+ Dense(1, activation='sigmoid')
50
+ ])
51
+
52
+ model.compile(
53
+ optimizer='adam',
54
+ loss='binary_crossentropy',
55
+ metrics=['accuracy']
56
+ )
57
+
58
+ history = model.fit(
59
+ X_train, y_train,
60
+ epochs=20,
61
+ batch_size=64,
62
+ validation_split=0.2,
63
+ verbose=1
64
+ )
65
+
66
+ self.neural_net = model
67
+ return model, history
68
+
69
+ def evaluate_model(self, model, X_test, y_test, model_name="Model"):
70
+ """Evaluate model performance with various metrics"""
71
+ if model_name == "Neural Network":
72
+ y_pred_proba = model.predict(X_test)
73
+ y_pred = (y_pred_proba > 0.5).astype(int)
74
+ else:
75
+ y_pred = model.predict(X_test)
76
+ y_pred_proba = model.predict_proba(X_test)[:, 1]
77
+
78
+ # Calculate metrics
79
+ accuracy = accuracy_score(y_test, y_pred)
80
+ precision = precision_score(y_test, y_pred)
81
+ recall = recall_score(y_test, y_pred)
82
+ f1 = f1_score(y_test, y_pred)
83
+ auc = roc_auc_score(y_test, y_pred_proba)
84
+
85
+ # Create confusion matrix
86
+ cm = confusion_matrix(y_test, y_pred)
87
+
88
+ # Detailed classification report
89
+ report = classification_report(y_test, y_pred)
90
+
91
+ results = {
92
+ 'model_name': model_name,
93
+ 'accuracy': accuracy,
94
+ 'precision': precision,
95
+ 'recall': recall,
96
+ 'f1_score': f1,
97
+ 'auc': auc,
98
+ 'confusion_matrix': cm,
99
+ 'classification_report': report,
100
+ 'y_test': y_test,
101
+ 'y_pred_proba': y_pred_proba
102
+ }
103
+
104
+ return results
105
+
106
+ def save_model(self, model, file_path):
107
+ """Save the trained model to a file"""
108
+ if isinstance(model, Sequential):
109
+ model.save(file_path)
110
+ else:
111
+ with open(file_path, 'wb') as f:
112
+ pickle.dump(model, f)
113
+
114
+ def load_model(self, file_path, model_type='sklearn'):
115
+ """Load a trained model from a file"""
116
+ if model_type == 'keras':
117
+ from tensorflow.keras.models import load_model
118
+ return load_model(file_path)
119
+ else:
120
+ with open(file_path, 'rb') as f:
121
+ return pickle.load(f)
preprocessed_data.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:895dc3dad2840ac9e05c12d6442bd739d879c2d405e9b065efc0f1973be46a84
3
+ size 151102405
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ pandas
3
+ numpy
4
+ scikit-learn
5
+ matplotlib
6
+ seaborn
7
+ plotly
8
+ imbalanced-learn
9
+ xgboost
10
+ tensorflow
11
+ shap
uploaded_data.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:895dc3dad2840ac9e05c12d6442bd739d879c2d405e9b065efc0f1973be46a84
3
+ size 151102405
visualizer.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import numpy as np
3
+ import matplotlib.pyplot as plt
4
+ import seaborn as sns
5
+ import plotly.express as px
6
+ import plotly.graph_objects as go
7
+ from sklearn.metrics import roc_curve, precision_recall_curve
8
+ import shap
9
+
10
+ class Visualizer:
11
+ def __init__(self):
12
+ pass
13
+
14
+ def plot_class_distribution(self, df, target_col='Class'):
15
+ """Plot the distribution of fraud vs non-fraud transactions"""
16
+ plt.figure(figsize=(10, 6))
17
+ sns.countplot(x=target_col, data=df)
18
+ plt.title('Class Distribution (Fraud vs Non-Fraud)')
19
+ plt.xlabel('Class (0: Normal, 1: Fraud)')
20
+ plt.ylabel('Count')
21
+
22
+ # Add percentage labels
23
+ total = len(df)
24
+ for p in plt.gca().patches:
25
+ height = p.get_height()
26
+ plt.text(p.get_x() + p.get_width()/2.,
27
+ height + 3,
28
+ '{:.2f}%'.format(100 * height/total),
29
+ ha="center")
30
+
31
+ return plt
32
+
33
+ def plot_feature_distributions(self, df, target_col='Class', n_features=5):
34
+ """Plot distributions of top features by class"""
35
+ # Select numerical columns only
36
+ num_cols = df.select_dtypes(include=['int64', 'float64']).columns
37
+ num_cols = [col for col in num_cols if col != target_col]
38
+
39
+ # If there are too many features, select a subset
40
+ if len(num_cols) > n_features:
41
+ num_cols = num_cols[:n_features]
42
+
43
+ # Create subplots
44
+ fig, axes = plt.subplots(len(num_cols), 1, figsize=(12, 4*len(num_cols)))
45
+
46
+ # If there's only one feature, axes won't be an array
47
+ if len(num_cols) == 1:
48
+ axes = [axes]
49
+
50
+ for i, col in enumerate(num_cols):
51
+ sns.histplot(data=df, x=col, hue=target_col, bins=50, ax=axes[i], kde=True)
52
+ axes[i].set_title(f'Distribution of {col} by Class')
53
+
54
+ plt.tight_layout()
55
+ return fig
56
+
57
+ def plot_correlation_matrix(self, df, target_col='Class'):
58
+ """Plot correlation matrix of features"""
59
+ # Calculate correlation matrix
60
+ corr_matrix = df.corr()
61
+
62
+ # Create heatmap
63
+ plt.figure(figsize=(12, 10))
64
+ mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
65
+ sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='coolwarm',
66
+ linewidths=0.5, vmin=-1, vmax=1)
67
+ plt.title('Feature Correlation Matrix')
68
+
69
+ return plt
70
+
71
+ def plot_feature_importance(self, model, feature_names, model_name="Model"):
72
+ """Plot feature importance for tree-based models"""
73
+ if hasattr(model, 'feature_importances_'):
74
+ # Get feature importances
75
+ importances = model.feature_importances_
76
+
77
+ # Sort feature importances in descending order
78
+ indices = np.argsort(importances)[::-1]
79
+
80
+ # Rearrange feature names so they match the sorted feature importances
81
+ names = [feature_names[i] for i in indices]
82
+
83
+ # Create plot
84
+ plt.figure(figsize=(12, 8))
85
+ plt.title(f"Feature Importance ({model_name})")
86
+ plt.bar(range(len(importances)), importances[indices])
87
+ plt.xticks(range(len(importances)), names, rotation=90)
88
+ plt.tight_layout()
89
+
90
+ return plt
91
+ else:
92
+ print(f"Model {model_name} doesn't have feature_importances_ attribute")
93
+ return None
94
+
95
+ def plot_roc_curve(self, models_results):
96
+ """Plot ROC curves for multiple models"""
97
+ plt.figure(figsize=(10, 8))
98
+
99
+ for result in models_results:
100
+ model_name = result['model_name']
101
+ y_test = result['y_test']
102
+ y_pred_proba = result['y_pred_proba']
103
+
104
+ fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
105
+ auc = result['auc']
106
+
107
+ plt.plot(fpr, tpr, label=f'{model_name} (AUC = {auc:.3f})')
108
+
109
+ plt.plot([0, 1], [0, 1], 'k--')
110
+ plt.xlabel('False Positive Rate')
111
+ plt.ylabel('True Positive Rate')
112
+ plt.title('ROC Curve')
113
+ plt.legend(loc='best')
114
+
115
+ return plt
116
+
117
+ def plot_precision_recall_curve(self, models_results):
118
+ """Plot Precision-Recall curves for multiple models"""
119
+ plt.figure(figsize=(10, 8))
120
+
121
+ for result in models_results:
122
+ model_name = result['model_name']
123
+ y_test = result['y_test']
124
+ y_pred_proba = result['y_pred_proba']
125
+
126
+ precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
127
+
128
+ plt.plot(recall, precision, label=f'{model_name}')
129
+
130
+ plt.xlabel('Recall')
131
+ plt.ylabel('Precision')
132
+ plt.title('Precision-Recall Curve')
133
+ plt.legend(loc='best')
134
+
135
+ return plt
136
+
137
+ def plot_confusion_matrix(self, cm, model_name="Model"):
138
+ """Plot confusion matrix"""
139
+ plt.figure(figsize=(8, 6))
140
+ sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
141
+ plt.title(f'Confusion Matrix - {model_name}')
142
+ plt.ylabel('Actual')
143
+ plt.xlabel('Predicted')
144
+
145
+ return plt
146
+
147
+ def plot_shap_values(self, model, X_test, feature_names, model_name="Model"):
148
+ """Plot SHAP values to explain model predictions"""
149
+ # Create explainer
150
+ if model_name == "XGBoost":
151
+ explainer = shap.TreeExplainer(model)
152
+ else:
153
+ explainer = shap.Explainer(model)
154
+
155
+ # Calculate SHAP values
156
+ shap_values = explainer.shap_values(X_test)
157
+
158
+ # Summary plot
159
+ plt.figure(figsize=(12, 8))
160
+ shap.summary_plot(shap_values, X_test, feature_names=feature_names)
161
+
162
+ return plt