RichardVR commited on
Commit
3d3965f
Β·
verified Β·
1 Parent(s): 38f20d7

Upload Direction Classification.ipynb

Browse files
Copper Google Trend Analysis/Direction Classification.ipynb ADDED
@@ -0,0 +1,2206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "id": "16e2f19c",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "\n",
14
+ "── Label distribution across five splits ──\n"
15
+ ]
16
+ },
17
+ {
18
+ "data": {
19
+ "text/html": [
20
+ "<div>\n",
21
+ "<style scoped>\n",
22
+ " .dataframe tbody tr th:only-of-type {\n",
23
+ " vertical-align: middle;\n",
24
+ " }\n",
25
+ "\n",
26
+ " .dataframe tbody tr th {\n",
27
+ " vertical-align: top;\n",
28
+ " }\n",
29
+ "\n",
30
+ " .dataframe thead th {\n",
31
+ " text-align: right;\n",
32
+ " }\n",
33
+ "</style>\n",
34
+ "<table border=\"1\" class=\"dataframe\">\n",
35
+ " <thead>\n",
36
+ " <tr style=\"text-align: right;\">\n",
37
+ " <th></th>\n",
38
+ " <th>Train 0</th>\n",
39
+ " <th>Train 1</th>\n",
40
+ " <th>Train 0 %</th>\n",
41
+ " <th>Train 1 %</th>\n",
42
+ " <th>Test 0</th>\n",
43
+ " <th>Test 1</th>\n",
44
+ " <th>Test 0 %</th>\n",
45
+ " <th>Test 1 %</th>\n",
46
+ " </tr>\n",
47
+ " <tr>\n",
48
+ " <th>Split</th>\n",
49
+ " <th></th>\n",
50
+ " <th></th>\n",
51
+ " <th></th>\n",
52
+ " <th></th>\n",
53
+ " <th></th>\n",
54
+ " <th></th>\n",
55
+ " <th></th>\n",
56
+ " <th></th>\n",
57
+ " </tr>\n",
58
+ " </thead>\n",
59
+ " <tbody>\n",
60
+ " <tr>\n",
61
+ " <th>1</th>\n",
62
+ " <td>69</td>\n",
63
+ " <td>85</td>\n",
64
+ " <td>44.8%</td>\n",
65
+ " <td>55.2%</td>\n",
66
+ " <td>24</td>\n",
67
+ " <td>27</td>\n",
68
+ " <td>47.1%</td>\n",
69
+ " <td>52.9%</td>\n",
70
+ " </tr>\n",
71
+ " <tr>\n",
72
+ " <th>2</th>\n",
73
+ " <td>77</td>\n",
74
+ " <td>90</td>\n",
75
+ " <td>46.1%</td>\n",
76
+ " <td>53.9%</td>\n",
77
+ " <td>23</td>\n",
78
+ " <td>28</td>\n",
79
+ " <td>45.1%</td>\n",
80
+ " <td>54.9%</td>\n",
81
+ " </tr>\n",
82
+ " <tr>\n",
83
+ " <th>3</th>\n",
84
+ " <td>85</td>\n",
85
+ " <td>95</td>\n",
86
+ " <td>47.2%</td>\n",
87
+ " <td>52.8%</td>\n",
88
+ " <td>23</td>\n",
89
+ " <td>28</td>\n",
90
+ " <td>45.1%</td>\n",
91
+ " <td>54.9%</td>\n",
92
+ " </tr>\n",
93
+ " <tr>\n",
94
+ " <th>4</th>\n",
95
+ " <td>91</td>\n",
96
+ " <td>102</td>\n",
97
+ " <td>47.2%</td>\n",
98
+ " <td>52.8%</td>\n",
99
+ " <td>23</td>\n",
100
+ " <td>28</td>\n",
101
+ " <td>45.1%</td>\n",
102
+ " <td>54.9%</td>\n",
103
+ " </tr>\n",
104
+ " <tr>\n",
105
+ " <th>5</th>\n",
106
+ " <td>93</td>\n",
107
+ " <td>113</td>\n",
108
+ " <td>45.1%</td>\n",
109
+ " <td>54.9%</td>\n",
110
+ " <td>27</td>\n",
111
+ " <td>24</td>\n",
112
+ " <td>52.9%</td>\n",
113
+ " <td>47.1%</td>\n",
114
+ " </tr>\n",
115
+ " </tbody>\n",
116
+ "</table>\n",
117
+ "</div>"
118
+ ],
119
+ "text/plain": [
120
+ " Train 0 Train 1 Train 0 % Train 1 % Test 0 Test 1 Test 0 % Test 1 %\n",
121
+ "Split \n",
122
+ "1 69 85 44.8% 55.2% 24 27 47.1% 52.9%\n",
123
+ "2 77 90 46.1% 53.9% 23 28 45.1% 54.9%\n",
124
+ "3 85 95 47.2% 52.8% 23 28 45.1% 54.9%\n",
125
+ "4 91 102 47.2% 52.8% 23 28 45.1% 54.9%\n",
126
+ "5 93 113 45.1% 54.9% 27 24 52.9% 47.1%"
127
+ ]
128
+ },
129
+ "metadata": {},
130
+ "output_type": "display_data"
131
+ },
132
+ {
133
+ "name": "stdout",
134
+ "output_type": "stream",
135
+ "text": [
136
+ "\n",
137
+ "── Accuracy per split (plus Avg & Max) ──\n"
138
+ ]
139
+ },
140
+ {
141
+ "data": {
142
+ "text/html": [
143
+ "<div>\n",
144
+ "<style scoped>\n",
145
+ " .dataframe tbody tr th:only-of-type {\n",
146
+ " vertical-align: middle;\n",
147
+ " }\n",
148
+ "\n",
149
+ " .dataframe tbody tr th {\n",
150
+ " vertical-align: top;\n",
151
+ " }\n",
152
+ "\n",
153
+ " .dataframe thead th {\n",
154
+ " text-align: right;\n",
155
+ " }\n",
156
+ "</style>\n",
157
+ "<table border=\"1\" class=\"dataframe\">\n",
158
+ " <thead>\n",
159
+ " <tr style=\"text-align: right;\">\n",
160
+ " <th></th>\n",
161
+ " <th>Split</th>\n",
162
+ " <th>1</th>\n",
163
+ " <th>2</th>\n",
164
+ " <th>3</th>\n",
165
+ " <th>4</th>\n",
166
+ " <th>5</th>\n",
167
+ " <th>Avg</th>\n",
168
+ " <th>Max</th>\n",
169
+ " </tr>\n",
170
+ " <tr>\n",
171
+ " <th>Model</th>\n",
172
+ " <th>Scenario</th>\n",
173
+ " <th></th>\n",
174
+ " <th></th>\n",
175
+ " <th></th>\n",
176
+ " <th></th>\n",
177
+ " <th></th>\n",
178
+ " <th></th>\n",
179
+ " <th></th>\n",
180
+ " </tr>\n",
181
+ " </thead>\n",
182
+ " <tbody>\n",
183
+ " <tr>\n",
184
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
185
+ " <th>0.05</th>\n",
186
+ " <td>56.86%</td>\n",
187
+ " <td>60.78%</td>\n",
188
+ " <td>45.10%</td>\n",
189
+ " <td>49.02%</td>\n",
190
+ " <td>49.02%</td>\n",
191
+ " <td>52.16%</td>\n",
192
+ " <td>60.78%</td>\n",
193
+ " </tr>\n",
194
+ " <tr>\n",
195
+ " <th>0.10</th>\n",
196
+ " <td>47.06%</td>\n",
197
+ " <td>56.86%</td>\n",
198
+ " <td>60.78%</td>\n",
199
+ " <td>49.02%</td>\n",
200
+ " <td>41.18%</td>\n",
201
+ " <td>50.98%</td>\n",
202
+ " <td>60.78%</td>\n",
203
+ " </tr>\n",
204
+ " <tr>\n",
205
+ " <th>without</th>\n",
206
+ " <td>62.75%</td>\n",
207
+ " <td>62.75%</td>\n",
208
+ " <td>56.86%</td>\n",
209
+ " <td>49.02%</td>\n",
210
+ " <td>58.82%</td>\n",
211
+ " <td>58.04%</td>\n",
212
+ " <td>62.75%</td>\n",
213
+ " </tr>\n",
214
+ " <tr>\n",
215
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
216
+ " <th>0.05</th>\n",
217
+ " <td>56.86%</td>\n",
218
+ " <td>49.02%</td>\n",
219
+ " <td>49.02%</td>\n",
220
+ " <td>49.02%</td>\n",
221
+ " <td>56.86%</td>\n",
222
+ " <td>52.16%</td>\n",
223
+ " <td>56.86%</td>\n",
224
+ " </tr>\n",
225
+ " <tr>\n",
226
+ " <th>0.10</th>\n",
227
+ " <td>58.82%</td>\n",
228
+ " <td>39.22%</td>\n",
229
+ " <td>45.10%</td>\n",
230
+ " <td>47.06%</td>\n",
231
+ " <td>56.86%</td>\n",
232
+ " <td>49.41%</td>\n",
233
+ " <td>58.82%</td>\n",
234
+ " </tr>\n",
235
+ " <tr>\n",
236
+ " <th>without</th>\n",
237
+ " <td>56.86%</td>\n",
238
+ " <td>56.86%</td>\n",
239
+ " <td>54.90%</td>\n",
240
+ " <td>52.94%</td>\n",
241
+ " <td>52.94%</td>\n",
242
+ " <td>54.90%</td>\n",
243
+ " <td>56.86%</td>\n",
244
+ " </tr>\n",
245
+ " <tr>\n",
246
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
247
+ " <th>0.05</th>\n",
248
+ " <td>41.18%</td>\n",
249
+ " <td>47.06%</td>\n",
250
+ " <td>49.02%</td>\n",
251
+ " <td>47.06%</td>\n",
252
+ " <td>47.06%</td>\n",
253
+ " <td>46.27%</td>\n",
254
+ " <td>49.02%</td>\n",
255
+ " </tr>\n",
256
+ " <tr>\n",
257
+ " <th>0.10</th>\n",
258
+ " <td>37.25%</td>\n",
259
+ " <td>45.10%</td>\n",
260
+ " <td>49.02%</td>\n",
261
+ " <td>47.06%</td>\n",
262
+ " <td>47.06%</td>\n",
263
+ " <td>45.10%</td>\n",
264
+ " <td>49.02%</td>\n",
265
+ " </tr>\n",
266
+ " <tr>\n",
267
+ " <th>without</th>\n",
268
+ " <td>52.94%</td>\n",
269
+ " <td>60.78%</td>\n",
270
+ " <td>58.82%</td>\n",
271
+ " <td>60.78%</td>\n",
272
+ " <td>58.82%</td>\n",
273
+ " <td>58.43%</td>\n",
274
+ " <td>60.78%</td>\n",
275
+ " </tr>\n",
276
+ " <tr>\n",
277
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
278
+ " <th>0.05</th>\n",
279
+ " <td>47.06%</td>\n",
280
+ " <td>58.82%</td>\n",
281
+ " <td>45.10%</td>\n",
282
+ " <td>47.06%</td>\n",
283
+ " <td>47.06%</td>\n",
284
+ " <td>49.02%</td>\n",
285
+ " <td>58.82%</td>\n",
286
+ " </tr>\n",
287
+ " <tr>\n",
288
+ " <th>0.10</th>\n",
289
+ " <td>54.90%</td>\n",
290
+ " <td>54.90%</td>\n",
291
+ " <td>45.10%</td>\n",
292
+ " <td>45.10%</td>\n",
293
+ " <td>45.10%</td>\n",
294
+ " <td>49.02%</td>\n",
295
+ " <td>54.90%</td>\n",
296
+ " </tr>\n",
297
+ " <tr>\n",
298
+ " <th>without</th>\n",
299
+ " <td>60.78%</td>\n",
300
+ " <td>52.94%</td>\n",
301
+ " <td>45.10%</td>\n",
302
+ " <td>50.98%</td>\n",
303
+ " <td>52.94%</td>\n",
304
+ " <td>52.55%</td>\n",
305
+ " <td>60.78%</td>\n",
306
+ " </tr>\n",
307
+ " <tr>\n",
308
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
309
+ " <th>0.05</th>\n",
310
+ " <td>52.94%</td>\n",
311
+ " <td>50.98%</td>\n",
312
+ " <td>49.02%</td>\n",
313
+ " <td>50.98%</td>\n",
314
+ " <td>56.86%</td>\n",
315
+ " <td>52.16%</td>\n",
316
+ " <td>56.86%</td>\n",
317
+ " </tr>\n",
318
+ " <tr>\n",
319
+ " <th>0.10</th>\n",
320
+ " <td>49.02%</td>\n",
321
+ " <td>52.94%</td>\n",
322
+ " <td>43.14%</td>\n",
323
+ " <td>52.94%</td>\n",
324
+ " <td>50.98%</td>\n",
325
+ " <td>49.80%</td>\n",
326
+ " <td>52.94%</td>\n",
327
+ " </tr>\n",
328
+ " <tr>\n",
329
+ " <th>without</th>\n",
330
+ " <td>58.82%</td>\n",
331
+ " <td>60.78%</td>\n",
332
+ " <td>56.86%</td>\n",
333
+ " <td>64.71%</td>\n",
334
+ " <td>58.82%</td>\n",
335
+ " <td>60.00%</td>\n",
336
+ " <td>64.71%</td>\n",
337
+ " </tr>\n",
338
+ " </tbody>\n",
339
+ "</table>\n",
340
+ "</div>"
341
+ ],
342
+ "text/plain": [
343
+ "Split 1 2 3 4 5 Avg \\\n",
344
+ "Model Scenario \n",
345
+ "Decision Tree 0.05 56.86% 60.78% 45.10% 49.02% 49.02% 52.16% \n",
346
+ " 0.10 47.06% 56.86% 60.78% 49.02% 41.18% 50.98% \n",
347
+ " without 62.75% 62.75% 56.86% 49.02% 58.82% 58.04% \n",
348
+ "Logistic Regression 0.05 56.86% 49.02% 49.02% 49.02% 56.86% 52.16% \n",
349
+ " 0.10 58.82% 39.22% 45.10% 47.06% 56.86% 49.41% \n",
350
+ " without 56.86% 56.86% 54.90% 52.94% 52.94% 54.90% \n",
351
+ "Random Forest 0.05 41.18% 47.06% 49.02% 47.06% 47.06% 46.27% \n",
352
+ " 0.10 37.25% 45.10% 49.02% 47.06% 47.06% 45.10% \n",
353
+ " without 52.94% 60.78% 58.82% 60.78% 58.82% 58.43% \n",
354
+ "SVM 0.05 47.06% 58.82% 45.10% 47.06% 47.06% 49.02% \n",
355
+ " 0.10 54.90% 54.90% 45.10% 45.10% 45.10% 49.02% \n",
356
+ " without 60.78% 52.94% 45.10% 50.98% 52.94% 52.55% \n",
357
+ "XGBoost 0.05 52.94% 50.98% 49.02% 50.98% 56.86% 52.16% \n",
358
+ " 0.10 49.02% 52.94% 43.14% 52.94% 50.98% 49.80% \n",
359
+ " without 58.82% 60.78% 56.86% 64.71% 58.82% 60.00% \n",
360
+ "\n",
361
+ "Split Max \n",
362
+ "Model Scenario \n",
363
+ "Decision Tree 0.05 60.78% \n",
364
+ " 0.10 60.78% \n",
365
+ " without 62.75% \n",
366
+ "Logistic Regression 0.05 56.86% \n",
367
+ " 0.10 58.82% \n",
368
+ " without 56.86% \n",
369
+ "Random Forest 0.05 49.02% \n",
370
+ " 0.10 49.02% \n",
371
+ " without 60.78% \n",
372
+ "SVM 0.05 58.82% \n",
373
+ " 0.10 54.90% \n",
374
+ " without 60.78% \n",
375
+ "XGBoost 0.05 56.86% \n",
376
+ " 0.10 52.94% \n",
377
+ " without 64.71% "
378
+ ]
379
+ },
380
+ "metadata": {},
381
+ "output_type": "display_data"
382
+ },
383
+ {
384
+ "name": "stdout",
385
+ "output_type": "stream",
386
+ "text": [
387
+ "\n",
388
+ "── F1-score per split (plus Avg & Max) ──\n"
389
+ ]
390
+ },
391
+ {
392
+ "data": {
393
+ "text/html": [
394
+ "<div>\n",
395
+ "<style scoped>\n",
396
+ " .dataframe tbody tr th:only-of-type {\n",
397
+ " vertical-align: middle;\n",
398
+ " }\n",
399
+ "\n",
400
+ " .dataframe tbody tr th {\n",
401
+ " vertical-align: top;\n",
402
+ " }\n",
403
+ "\n",
404
+ " .dataframe thead th {\n",
405
+ " text-align: right;\n",
406
+ " }\n",
407
+ "</style>\n",
408
+ "<table border=\"1\" class=\"dataframe\">\n",
409
+ " <thead>\n",
410
+ " <tr style=\"text-align: right;\">\n",
411
+ " <th></th>\n",
412
+ " <th>Split</th>\n",
413
+ " <th>1</th>\n",
414
+ " <th>2</th>\n",
415
+ " <th>3</th>\n",
416
+ " <th>4</th>\n",
417
+ " <th>5</th>\n",
418
+ " <th>Avg</th>\n",
419
+ " <th>Max</th>\n",
420
+ " </tr>\n",
421
+ " <tr>\n",
422
+ " <th>Model</th>\n",
423
+ " <th>Scenario</th>\n",
424
+ " <th></th>\n",
425
+ " <th></th>\n",
426
+ " <th></th>\n",
427
+ " <th></th>\n",
428
+ " <th></th>\n",
429
+ " <th></th>\n",
430
+ " <th></th>\n",
431
+ " </tr>\n",
432
+ " </thead>\n",
433
+ " <tbody>\n",
434
+ " <tr>\n",
435
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
436
+ " <th>0.05</th>\n",
437
+ " <td>60.71%</td>\n",
438
+ " <td>61.54%</td>\n",
439
+ " <td>48.15%</td>\n",
440
+ " <td>45.83%</td>\n",
441
+ " <td>61.76%</td>\n",
442
+ " <td>55.60%</td>\n",
443
+ " <td>61.76%</td>\n",
444
+ " </tr>\n",
445
+ " <tr>\n",
446
+ " <th>0.10</th>\n",
447
+ " <td>57.14%</td>\n",
448
+ " <td>50.00%</td>\n",
449
+ " <td>64.29%</td>\n",
450
+ " <td>45.83%</td>\n",
451
+ " <td>37.50%</td>\n",
452
+ " <td>50.95%</td>\n",
453
+ " <td>64.29%</td>\n",
454
+ " </tr>\n",
455
+ " <tr>\n",
456
+ " <th>without</th>\n",
457
+ " <td>72.46%</td>\n",
458
+ " <td>64.15%</td>\n",
459
+ " <td>47.62%</td>\n",
460
+ " <td>51.85%</td>\n",
461
+ " <td>46.15%</td>\n",
462
+ " <td>56.45%</td>\n",
463
+ " <td>72.46%</td>\n",
464
+ " </tr>\n",
465
+ " <tr>\n",
466
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
467
+ " <th>0.05</th>\n",
468
+ " <td>45.00%</td>\n",
469
+ " <td>27.78%</td>\n",
470
+ " <td>23.53%</td>\n",
471
+ " <td>13.33%</td>\n",
472
+ " <td>63.33%</td>\n",
473
+ " <td>34.59%</td>\n",
474
+ " <td>63.33%</td>\n",
475
+ " </tr>\n",
476
+ " <tr>\n",
477
+ " <th>0.10</th>\n",
478
+ " <td>66.67%</td>\n",
479
+ " <td>45.61%</td>\n",
480
+ " <td>0.00%</td>\n",
481
+ " <td>6.90%</td>\n",
482
+ " <td>63.33%</td>\n",
483
+ " <td>36.50%</td>\n",
484
+ " <td>66.67%</td>\n",
485
+ " </tr>\n",
486
+ " <tr>\n",
487
+ " <th>without</th>\n",
488
+ " <td>60.71%</td>\n",
489
+ " <td>56.00%</td>\n",
490
+ " <td>46.51%</td>\n",
491
+ " <td>25.00%</td>\n",
492
+ " <td>0.00%</td>\n",
493
+ " <td>37.65%</td>\n",
494
+ " <td>60.71%</td>\n",
495
+ " </tr>\n",
496
+ " <tr>\n",
497
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
498
+ " <th>0.05</th>\n",
499
+ " <td>44.44%</td>\n",
500
+ " <td>27.03%</td>\n",
501
+ " <td>23.53%</td>\n",
502
+ " <td>18.18%</td>\n",
503
+ " <td>64.00%</td>\n",
504
+ " <td>35.44%</td>\n",
505
+ " <td>64.00%</td>\n",
506
+ " </tr>\n",
507
+ " <tr>\n",
508
+ " <th>0.10</th>\n",
509
+ " <td>42.86%</td>\n",
510
+ " <td>6.67%</td>\n",
511
+ " <td>13.33%</td>\n",
512
+ " <td>6.90%</td>\n",
513
+ " <td>64.00%</td>\n",
514
+ " <td>26.75%</td>\n",
515
+ " <td>64.00%</td>\n",
516
+ " </tr>\n",
517
+ " <tr>\n",
518
+ " <th>without</th>\n",
519
+ " <td>67.57%</td>\n",
520
+ " <td>61.54%</td>\n",
521
+ " <td>57.14%</td>\n",
522
+ " <td>60.00%</td>\n",
523
+ " <td>46.15%</td>\n",
524
+ " <td>58.48%</td>\n",
525
+ " <td>67.57%</td>\n",
526
+ " </tr>\n",
527
+ " <tr>\n",
528
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
529
+ " <th>0.05</th>\n",
530
+ " <td>0.00%</td>\n",
531
+ " <td>61.82%</td>\n",
532
+ " <td>0.00%</td>\n",
533
+ " <td>6.90%</td>\n",
534
+ " <td>64.00%</td>\n",
535
+ " <td>26.54%</td>\n",
536
+ " <td>64.00%</td>\n",
537
+ " </tr>\n",
538
+ " <tr>\n",
539
+ " <th>0.10</th>\n",
540
+ " <td>64.62%</td>\n",
541
+ " <td>70.89%</td>\n",
542
+ " <td>0.00%</td>\n",
543
+ " <td>0.00%</td>\n",
544
+ " <td>61.11%</td>\n",
545
+ " <td>39.32%</td>\n",
546
+ " <td>70.89%</td>\n",
547
+ " </tr>\n",
548
+ " <tr>\n",
549
+ " <th>without</th>\n",
550
+ " <td>50.00%</td>\n",
551
+ " <td>45.45%</td>\n",
552
+ " <td>0.00%</td>\n",
553
+ " <td>19.35%</td>\n",
554
+ " <td>0.00%</td>\n",
555
+ " <td>22.96%</td>\n",
556
+ " <td>50.00%</td>\n",
557
+ " </tr>\n",
558
+ " <tr>\n",
559
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
560
+ " <th>0.05</th>\n",
561
+ " <td>58.62%</td>\n",
562
+ " <td>46.81%</td>\n",
563
+ " <td>31.58%</td>\n",
564
+ " <td>28.57%</td>\n",
565
+ " <td>67.65%</td>\n",
566
+ " <td>46.65%</td>\n",
567
+ " <td>67.65%</td>\n",
568
+ " </tr>\n",
569
+ " <tr>\n",
570
+ " <th>0.10</th>\n",
571
+ " <td>53.57%</td>\n",
572
+ " <td>42.86%</td>\n",
573
+ " <td>21.62%</td>\n",
574
+ " <td>33.33%</td>\n",
575
+ " <td>65.75%</td>\n",
576
+ " <td>43.43%</td>\n",
577
+ " <td>65.75%</td>\n",
578
+ " </tr>\n",
579
+ " <tr>\n",
580
+ " <th>without</th>\n",
581
+ " <td>68.66%</td>\n",
582
+ " <td>62.96%</td>\n",
583
+ " <td>52.17%</td>\n",
584
+ " <td>67.86%</td>\n",
585
+ " <td>53.33%</td>\n",
586
+ " <td>61.00%</td>\n",
587
+ " <td>68.66%</td>\n",
588
+ " </tr>\n",
589
+ " </tbody>\n",
590
+ "</table>\n",
591
+ "</div>"
592
+ ],
593
+ "text/plain": [
594
+ "Split 1 2 3 4 5 Avg \\\n",
595
+ "Model Scenario \n",
596
+ "Decision Tree 0.05 60.71% 61.54% 48.15% 45.83% 61.76% 55.60% \n",
597
+ " 0.10 57.14% 50.00% 64.29% 45.83% 37.50% 50.95% \n",
598
+ " without 72.46% 64.15% 47.62% 51.85% 46.15% 56.45% \n",
599
+ "Logistic Regression 0.05 45.00% 27.78% 23.53% 13.33% 63.33% 34.59% \n",
600
+ " 0.10 66.67% 45.61% 0.00% 6.90% 63.33% 36.50% \n",
601
+ " without 60.71% 56.00% 46.51% 25.00% 0.00% 37.65% \n",
602
+ "Random Forest 0.05 44.44% 27.03% 23.53% 18.18% 64.00% 35.44% \n",
603
+ " 0.10 42.86% 6.67% 13.33% 6.90% 64.00% 26.75% \n",
604
+ " without 67.57% 61.54% 57.14% 60.00% 46.15% 58.48% \n",
605
+ "SVM 0.05 0.00% 61.82% 0.00% 6.90% 64.00% 26.54% \n",
606
+ " 0.10 64.62% 70.89% 0.00% 0.00% 61.11% 39.32% \n",
607
+ " without 50.00% 45.45% 0.00% 19.35% 0.00% 22.96% \n",
608
+ "XGBoost 0.05 58.62% 46.81% 31.58% 28.57% 67.65% 46.65% \n",
609
+ " 0.10 53.57% 42.86% 21.62% 33.33% 65.75% 43.43% \n",
610
+ " without 68.66% 62.96% 52.17% 67.86% 53.33% 61.00% \n",
611
+ "\n",
612
+ "Split Max \n",
613
+ "Model Scenario \n",
614
+ "Decision Tree 0.05 61.76% \n",
615
+ " 0.10 64.29% \n",
616
+ " without 72.46% \n",
617
+ "Logistic Regression 0.05 63.33% \n",
618
+ " 0.10 66.67% \n",
619
+ " without 60.71% \n",
620
+ "Random Forest 0.05 64.00% \n",
621
+ " 0.10 64.00% \n",
622
+ " without 67.57% \n",
623
+ "SVM 0.05 64.00% \n",
624
+ " 0.10 70.89% \n",
625
+ " without 50.00% \n",
626
+ "XGBoost 0.05 67.65% \n",
627
+ " 0.10 65.75% \n",
628
+ " without 68.66% "
629
+ ]
630
+ },
631
+ "metadata": {},
632
+ "output_type": "display_data"
633
+ },
634
+ {
635
+ "name": "stdout",
636
+ "output_type": "stream",
637
+ "text": [
638
+ "\n",
639
+ "── AUC per split (plus Avg & Max) ──\n"
640
+ ]
641
+ },
642
+ {
643
+ "data": {
644
+ "text/html": [
645
+ "<div>\n",
646
+ "<style scoped>\n",
647
+ " .dataframe tbody tr th:only-of-type {\n",
648
+ " vertical-align: middle;\n",
649
+ " }\n",
650
+ "\n",
651
+ " .dataframe tbody tr th {\n",
652
+ " vertical-align: top;\n",
653
+ " }\n",
654
+ "\n",
655
+ " .dataframe thead th {\n",
656
+ " text-align: right;\n",
657
+ " }\n",
658
+ "</style>\n",
659
+ "<table border=\"1\" class=\"dataframe\">\n",
660
+ " <thead>\n",
661
+ " <tr style=\"text-align: right;\">\n",
662
+ " <th></th>\n",
663
+ " <th>Split</th>\n",
664
+ " <th>1</th>\n",
665
+ " <th>2</th>\n",
666
+ " <th>3</th>\n",
667
+ " <th>4</th>\n",
668
+ " <th>5</th>\n",
669
+ " <th>Avg</th>\n",
670
+ " <th>Max</th>\n",
671
+ " </tr>\n",
672
+ " <tr>\n",
673
+ " <th>Model</th>\n",
674
+ " <th>Scenario</th>\n",
675
+ " <th></th>\n",
676
+ " <th></th>\n",
677
+ " <th></th>\n",
678
+ " <th></th>\n",
679
+ " <th></th>\n",
680
+ " <th></th>\n",
681
+ " <th></th>\n",
682
+ " </tr>\n",
683
+ " </thead>\n",
684
+ " <tbody>\n",
685
+ " <tr>\n",
686
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
687
+ " <th>0.05</th>\n",
688
+ " <td>0.5818</td>\n",
689
+ " <td>0.5893</td>\n",
690
+ " <td>0.3789</td>\n",
691
+ " <td>0.5194</td>\n",
692
+ " <td>0.5116</td>\n",
693
+ " <td>0.5162</td>\n",
694
+ " <td>0.5893</td>\n",
695
+ " </tr>\n",
696
+ " <tr>\n",
697
+ " <th>0.10</th>\n",
698
+ " <td>0.4522</td>\n",
699
+ " <td>0.5831</td>\n",
700
+ " <td>0.6172</td>\n",
701
+ " <td>0.5217</td>\n",
702
+ " <td>0.3843</td>\n",
703
+ " <td>0.5117</td>\n",
704
+ " <td>0.6172</td>\n",
705
+ " </tr>\n",
706
+ " <tr>\n",
707
+ " <th>without</th>\n",
708
+ " <td>0.6358</td>\n",
709
+ " <td>0.6219</td>\n",
710
+ " <td>0.6320</td>\n",
711
+ " <td>0.5520</td>\n",
712
+ " <td>0.6273</td>\n",
713
+ " <td>0.6138</td>\n",
714
+ " <td>0.6358</td>\n",
715
+ " </tr>\n",
716
+ " <tr>\n",
717
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
718
+ " <th>0.05</th>\n",
719
+ " <td>0.5910</td>\n",
720
+ " <td>0.5839</td>\n",
721
+ " <td>0.6398</td>\n",
722
+ " <td>0.7174</td>\n",
723
+ " <td>0.5895</td>\n",
724
+ " <td>0.6243</td>\n",
725
+ " <td>0.7174</td>\n",
726
+ " </tr>\n",
727
+ " <tr>\n",
728
+ " <th>0.10</th>\n",
729
+ " <td>0.6080</td>\n",
730
+ " <td>0.4022</td>\n",
731
+ " <td>0.4379</td>\n",
732
+ " <td>0.4534</td>\n",
733
+ " <td>0.6188</td>\n",
734
+ " <td>0.5041</td>\n",
735
+ " <td>0.6188</td>\n",
736
+ " </tr>\n",
737
+ " <tr>\n",
738
+ " <th>without</th>\n",
739
+ " <td>0.5664</td>\n",
740
+ " <td>0.5916</td>\n",
741
+ " <td>0.6429</td>\n",
742
+ " <td>0.7042</td>\n",
743
+ " <td>0.6111</td>\n",
744
+ " <td>0.6232</td>\n",
745
+ " <td>0.7042</td>\n",
746
+ " </tr>\n",
747
+ " <tr>\n",
748
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
749
+ " <th>0.05</th>\n",
750
+ " <td>0.4290</td>\n",
751
+ " <td>0.5901</td>\n",
752
+ " <td>0.6444</td>\n",
753
+ " <td>0.6211</td>\n",
754
+ " <td>0.5895</td>\n",
755
+ " <td>0.5748</td>\n",
756
+ " <td>0.6444</td>\n",
757
+ " </tr>\n",
758
+ " <tr>\n",
759
+ " <th>0.10</th>\n",
760
+ " <td>0.3318</td>\n",
761
+ " <td>0.4720</td>\n",
762
+ " <td>0.5124</td>\n",
763
+ " <td>0.5978</td>\n",
764
+ " <td>0.5216</td>\n",
765
+ " <td>0.4871</td>\n",
766
+ " <td>0.5978</td>\n",
767
+ " </tr>\n",
768
+ " <tr>\n",
769
+ " <th>without</th>\n",
770
+ " <td>0.5409</td>\n",
771
+ " <td>0.6002</td>\n",
772
+ " <td>0.6188</td>\n",
773
+ " <td>0.6064</td>\n",
774
+ " <td>0.5895</td>\n",
775
+ " <td>0.5911</td>\n",
776
+ " <td>0.6188</td>\n",
777
+ " </tr>\n",
778
+ " <tr>\n",
779
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
780
+ " <th>0.05</th>\n",
781
+ " <td>0.3565</td>\n",
782
+ " <td>0.4340</td>\n",
783
+ " <td>0.4332</td>\n",
784
+ " <td>0.3750</td>\n",
785
+ " <td>0.5185</td>\n",
786
+ " <td>0.4234</td>\n",
787
+ " <td>0.5185</td>\n",
788
+ " </tr>\n",
789
+ " <tr>\n",
790
+ " <th>0.10</th>\n",
791
+ " <td>0.4329</td>\n",
792
+ " <td>0.4775</td>\n",
793
+ " <td>0.4526</td>\n",
794
+ " <td>0.4689</td>\n",
795
+ " <td>0.5725</td>\n",
796
+ " <td>0.4809</td>\n",
797
+ " <td>0.5725</td>\n",
798
+ " </tr>\n",
799
+ " <tr>\n",
800
+ " <th>without</th>\n",
801
+ " <td>0.5664</td>\n",
802
+ " <td>0.5730</td>\n",
803
+ " <td>0.3571</td>\n",
804
+ " <td>0.7042</td>\n",
805
+ " <td>0.6111</td>\n",
806
+ " <td>0.5624</td>\n",
807
+ " <td>0.7042</td>\n",
808
+ " </tr>\n",
809
+ " <tr>\n",
810
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
811
+ " <th>0.05</th>\n",
812
+ " <td>0.4985</td>\n",
813
+ " <td>0.5481</td>\n",
814
+ " <td>0.5963</td>\n",
815
+ " <td>0.5342</td>\n",
816
+ " <td>0.5818</td>\n",
817
+ " <td>0.5518</td>\n",
818
+ " <td>0.5963</td>\n",
819
+ " </tr>\n",
820
+ " <tr>\n",
821
+ " <th>0.10</th>\n",
822
+ " <td>0.5139</td>\n",
823
+ " <td>0.5124</td>\n",
824
+ " <td>0.5404</td>\n",
825
+ " <td>0.5870</td>\n",
826
+ " <td>0.4429</td>\n",
827
+ " <td>0.5193</td>\n",
828
+ " <td>0.5870</td>\n",
829
+ " </tr>\n",
830
+ " <tr>\n",
831
+ " <th>without</th>\n",
832
+ " <td>0.6728</td>\n",
833
+ " <td>0.6180</td>\n",
834
+ " <td>0.6071</td>\n",
835
+ " <td>0.6025</td>\n",
836
+ " <td>0.6088</td>\n",
837
+ " <td>0.6219</td>\n",
838
+ " <td>0.6728</td>\n",
839
+ " </tr>\n",
840
+ " </tbody>\n",
841
+ "</table>\n",
842
+ "</div>"
843
+ ],
844
+ "text/plain": [
845
+ "Split 1 2 3 4 5 Avg \\\n",
846
+ "Model Scenario \n",
847
+ "Decision Tree 0.05 0.5818 0.5893 0.3789 0.5194 0.5116 0.5162 \n",
848
+ " 0.10 0.4522 0.5831 0.6172 0.5217 0.3843 0.5117 \n",
849
+ " without 0.6358 0.6219 0.6320 0.5520 0.6273 0.6138 \n",
850
+ "Logistic Regression 0.05 0.5910 0.5839 0.6398 0.7174 0.5895 0.6243 \n",
851
+ " 0.10 0.6080 0.4022 0.4379 0.4534 0.6188 0.5041 \n",
852
+ " without 0.5664 0.5916 0.6429 0.7042 0.6111 0.6232 \n",
853
+ "Random Forest 0.05 0.4290 0.5901 0.6444 0.6211 0.5895 0.5748 \n",
854
+ " 0.10 0.3318 0.4720 0.5124 0.5978 0.5216 0.4871 \n",
855
+ " without 0.5409 0.6002 0.6188 0.6064 0.5895 0.5911 \n",
856
+ "SVM 0.05 0.3565 0.4340 0.4332 0.3750 0.5185 0.4234 \n",
857
+ " 0.10 0.4329 0.4775 0.4526 0.4689 0.5725 0.4809 \n",
858
+ " without 0.5664 0.5730 0.3571 0.7042 0.6111 0.5624 \n",
859
+ "XGBoost 0.05 0.4985 0.5481 0.5963 0.5342 0.5818 0.5518 \n",
860
+ " 0.10 0.5139 0.5124 0.5404 0.5870 0.4429 0.5193 \n",
861
+ " without 0.6728 0.6180 0.6071 0.6025 0.6088 0.6219 \n",
862
+ "\n",
863
+ "Split Max \n",
864
+ "Model Scenario \n",
865
+ "Decision Tree 0.05 0.5893 \n",
866
+ " 0.10 0.6172 \n",
867
+ " without 0.6358 \n",
868
+ "Logistic Regression 0.05 0.7174 \n",
869
+ " 0.10 0.6188 \n",
870
+ " without 0.7042 \n",
871
+ "Random Forest 0.05 0.6444 \n",
872
+ " 0.10 0.5978 \n",
873
+ " without 0.6188 \n",
874
+ "SVM 0.05 0.5185 \n",
875
+ " 0.10 0.5725 \n",
876
+ " without 0.7042 \n",
877
+ "XGBoost 0.05 0.5963 \n",
878
+ " 0.10 0.5870 \n",
879
+ " without 0.6728 "
880
+ ]
881
+ },
882
+ "metadata": {},
883
+ "output_type": "display_data"
884
+ }
885
+ ],
886
+ "source": [
887
+ "# ================================================================\n",
888
+ "# Direction-of-Move Classification – full pipeline (nested CV)\n",
889
+ "# (MONTHLY version: all CSVs use a β€œMonth” column in YYYY-MM format)\n",
890
+ "# β€’ HMM & LSTM removed, XGBoost retained\n",
891
+ "# β€’ Feature standardisation before model training\n",
892
+ "# β€’ Nested TimeSeriesSplit for hyper-parameter tuning\n",
893
+ "# β€’ Accuracy, AUC, F1 tables\n",
894
+ "# ================================================================\n",
895
+ "import pathlib, warnings, numpy as np, pandas as pd\n",
896
+ "from statsmodels.tsa.stattools import adfuller, coint, grangercausalitytests\n",
897
+ "from sklearn.model_selection import GridSearchCV, TimeSeriesSplit\n",
898
+ "from sklearn.preprocessing import StandardScaler\n",
899
+ "from sklearn.linear_model import LogisticRegression\n",
900
+ "from sklearn.tree import DecisionTreeClassifier\n",
901
+ "from sklearn.ensemble import RandomForestClassifier\n",
902
+ "from sklearn.svm import SVC\n",
903
+ "import xgboost as xgb\n",
904
+ "from sklearn.metrics import accuracy_score, f1_score, roc_auc_score\n",
905
+ "\n",
906
+ "warnings.filterwarnings(\"ignore\")\n",
907
+ "pd.set_option(\"display.float_format\", \"{:,.4f}\".format)\n",
908
+ "np.random.seed(42)\n",
909
+ "\n",
910
+ "# ─────────────── 1β”‚ data (monthly) ──────────────────────────────\n",
911
+ "ROOT = pathlib.Path(\".\")\n",
912
+ "\n",
913
+ "def load_copper():\n",
914
+ " return (pd.read_csv(ROOT / \"Copper Prices.csv\")\n",
915
+ " .assign(Month=lambda d: pd.to_datetime(d[\"Month\"], format=\"%Y-%m\"))\n",
916
+ " .set_index(\"Month\") # keep month-start stamps\n",
917
+ " .asfreq(\"MS\") # align to month-start\n",
918
+ " .rename(columns={\"Price\": \"Copper_Price\"})[\"Copper_Price\"])\n",
919
+ "\n",
920
+ "def load_trends():\n",
921
+ " def one(folder):\n",
922
+ " frames = []\n",
923
+ " for fp in (ROOT / folder).glob(\"*.csv\"):\n",
924
+ " key = fp.stem.replace(\",\", \"\")\n",
925
+ " t = pd.read_csv(fp)\n",
926
+ " t.columns = [c.strip() for c in t.columns] # trim spaces\n",
927
+ " frames.append(\n",
928
+ " t.assign(Month=lambda d: pd.to_datetime(d[t.columns[0]], format=\"%Y-%m\"))\n",
929
+ " .set_index(\"Month\").asfreq(\"MS\")\n",
930
+ " .rename(columns={t.columns[1]: key})\n",
931
+ " )\n",
932
+ " return pd.concat(frames, axis=1)\n",
933
+ " cats = [\"Supply Factors\", \"Demand Factors\",\n",
934
+ " \"Speculative Factors\", \"Sudden Factors\"]\n",
935
+ " return pd.concat([one(c) for c in cats], axis=1).sort_index()\n",
936
+ "\n",
937
+ "copper, trends = load_copper(), load_trends()\n",
938
+ "data_raw = pd.concat([copper, trends], axis=1).dropna()\n",
939
+ "\n",
940
+ "# ─────────────── 2β”‚ statistical filters ─────────────────────────\n",
941
+ "def adf_p(s, min_obs=12):\n",
942
+ " x = s.dropna()\n",
943
+ " if len(x) < min_obs or x.nunique() < 2:\n",
944
+ " return np.nan # flag unusable series\n",
945
+ " return adfuller(x, autolag=\"AIC\")[1]\n",
946
+ "\n",
947
+ "ADF, COINT, MAX_LAG = 0.01, 0.5, 12\n",
948
+ "\n",
949
+ "i1 = [c for c in data_raw.columns\n",
950
+ " if (p0 := adf_p(data_raw[c])) is not np.nan and p0 > ADF\n",
951
+ " and (p1 := adf_p(data_raw[c].diff())) is not np.nan and p1 < ADF\n",
952
+ " and c != \"Copper_Price\"]\n",
953
+ "\n",
954
+ "cands = [s for s in i1\n",
955
+ " if coint(data_raw[\"Copper_Price\"], data_raw[s])[1] < COINT]\n",
956
+ "\n",
957
+ "minp = {s: min(grangercausalitytests(\n",
958
+ " data_raw[[\"Copper_Price\", s]].dropna().values,\n",
959
+ " maxlag=MAX_LAG, verbose=False)[lag][0][\"ssr_ftest\"][1]\n",
960
+ " for lag in range(1, MAX_LAG + 1))\n",
961
+ " for s in cands}\n",
962
+ "\n",
963
+ "TIERS = {0.05: [s for s,p in minp.items() if p < 0.05],\n",
964
+ " 0.10: [s for s,p in minp.items() if p < 0.10]}\n",
965
+ "\n",
966
+ "def lag_df(feats, lag=1):\n",
967
+ " out = {\"Copper_Price\": data_raw[\"Copper_Price\"],\n",
968
+ " f\"Copper_Price_lag{lag}\": data_raw[\"Copper_Price\"].shift(lag)}\n",
969
+ " out.update({f\"{f}_lag{lag}\": data_raw[f].shift(lag) for f in feats})\n",
970
+ " return pd.DataFrame(out).dropna()\n",
971
+ "\n",
972
+ "SCENS = {\"without\": lag_df([]),\n",
973
+ " \"0.05\" : lag_df(TIERS[0.05]),\n",
974
+ " \"0.10\" : lag_df(TIERS[0.10])}\n",
975
+ "for k in SCENS:\n",
976
+ " SCENS[k][\"y\"] = (SCENS[k][\"Copper_Price\"].diff().shift(-1) > 0).astype(int)\n",
977
+ " SCENS[k].dropna(inplace=True)\n",
978
+ "\n",
979
+ "# ─────────────── 3β”‚ label-distribution table ────────────────────\n",
980
+ "df_ref, n = SCENS[\"without\"], len(SCENS[\"without\"])\n",
981
+ "TEST_FRAC = 0.20\n",
982
+ "test_len = int(n * TEST_FRAC)\n",
983
+ "\n",
984
+ "rows = []\n",
985
+ "for i in range(5):\n",
986
+ " train_end = int(n * (0.80 + i*0.05))\n",
987
+ " tr, te = slice(0, train_end - test_len), slice(train_end - test_len, train_end)\n",
988
+ " y_tr, y_te = df_ref[\"y\"].iloc[tr], df_ref[\"y\"].iloc[te]\n",
989
+ " c_tr = y_tr.value_counts().reindex([0,1]).fillna(0).astype(int)\n",
990
+ " c_te = y_te.value_counts().reindex([0,1]).fillna(0).astype(int)\n",
991
+ " rows.append([i+1,\n",
992
+ " c_tr[0], c_tr[1], c_tr[0]/c_tr.sum()*100, c_tr[1]/c_tr.sum()*100,\n",
993
+ " c_te[0], c_te[1], c_te[0]/c_te.sum()*100, c_te[1]/c_te.sum()*100])\n",
994
+ "\n",
995
+ "label_dist = (pd.DataFrame(rows, columns=[\"Split\",\"Train 0\",\"Train 1\",\"Train 0 %\",\"Train 1 %\",\n",
996
+ " \"Test 0\",\"Test 1\",\"Test 0 %\",\"Test 1 %\"])\n",
997
+ " .set_index(\"Split\")\n",
998
+ " .applymap(lambda x: f\"{x:.1f}%\" if isinstance(x,float) else x))\n",
999
+ "print(\"\\n── Label distribution across five splits ──\")\n",
1000
+ "display(label_dist)\n",
1001
+ "\n",
1002
+ "# ─────────────── 4β”‚ model grids (unchanged) ─────────────────────\n",
1003
+ "GRIDS = {\n",
1004
+ " \"XGBoost\":[{\"n_estimators\":[400,600],\"max_depth\":[3,5],\n",
1005
+ " \"learning_rate\":[0.03,0.07],\"subsample\":[0.8,1.0]}],\n",
1006
+ " \"Logistic Regression\":[{\"C\":[0.1,1,10]}],\n",
1007
+ " \"Decision Tree\":[{\"max_depth\":[3,5,8],\"min_samples_leaf\":[2,4,6]}],\n",
1008
+ " \"Random Forest\":[{\"n_estimators\":[300,500],\"max_depth\":[4,6],\n",
1009
+ " \"min_samples_leaf\":[3,5]}],\n",
1010
+ " \"SVM\":[{\"C\":[0.1,1,10],\"gamma\":[0.01,0.1]}],\n",
1011
+ "}\n",
1012
+ "\n",
1013
+ "# ─────────────── 5β”‚ expanding-window splits ─────────────────────\n",
1014
+ "def expanding_splits(n_rows, test_frac=TEST_FRAC, n_splits=5):\n",
1015
+ " t_len = int(n_rows * test_frac)\n",
1016
+ " for i in range(n_splits):\n",
1017
+ " end = int(n_rows * (0.80 + i*0.05))\n",
1018
+ " yield np.arange(end - t_len), np.arange(end - t_len, end)\n",
1019
+ "\n",
1020
+ "INNER_CV = TimeSeriesSplit(n_splits=4)\n",
1021
+ "records = []\n",
1022
+ "\n",
1023
+ "# ─────────────── 6β”‚ nested-CV loop ──────────────────────────────\n",
1024
+ "for scen, df in SCENS.items():\n",
1025
+ " X_full, y_full = df.drop(columns=[\"Copper_Price\",\"y\"]), df[\"y\"]\n",
1026
+ " n = len(X_full)\n",
1027
+ "\n",
1028
+ " for split_idx, (tr_idx, te_idx) in enumerate(expanding_splits(n), 1):\n",
1029
+ " X_tr_raw, y_tr = X_full.iloc[tr_idx], y_full.iloc[tr_idx]\n",
1030
+ " X_te_raw, y_te = X_full.iloc[te_idx], y_full.iloc[te_idx]\n",
1031
+ "\n",
1032
+ " scaler = StandardScaler().fit(X_tr_raw)\n",
1033
+ " X_tr = pd.DataFrame(scaler.transform(X_tr_raw), columns=X_tr_raw.columns, index=X_tr_raw.index)\n",
1034
+ " X_te = pd.DataFrame(scaler.transform(X_te_raw), columns=X_te_raw.columns, index=X_te_raw.index)\n",
1035
+ "\n",
1036
+ " counts = y_tr.value_counts()\n",
1037
+ "\n",
1038
+ " for mname, grid in GRIDS.items():\n",
1039
+ " if mname == \"Logistic Regression\":\n",
1040
+ " base = LogisticRegression(max_iter=1000, class_weight='balanced')\n",
1041
+ " elif mname == \"Decision Tree\":\n",
1042
+ " base = DecisionTreeClassifier(random_state=42, class_weight='balanced')\n",
1043
+ " elif mname == \"Random Forest\":\n",
1044
+ " base = RandomForestClassifier(random_state=42, class_weight='balanced')\n",
1045
+ " elif mname == \"SVM\":\n",
1046
+ " base = SVC(kernel=\"rbf\", probability=True, class_weight='balanced', random_state=42)\n",
1047
+ " elif mname == \"XGBoost\":\n",
1048
+ " spw = counts.get(0,1) / counts.get(1,1) if len(counts)==2 else 1\n",
1049
+ " base = xgb.XGBClassifier(random_state=42,\n",
1050
+ " objective=\"binary:logistic\",\n",
1051
+ " eval_metric=\"logloss\",\n",
1052
+ " use_label_encoder=False,\n",
1053
+ " scale_pos_weight=spw)\n",
1054
+ "\n",
1055
+ " best = (GridSearchCV(base, grid, cv=INNER_CV,\n",
1056
+ " scoring=\"accuracy\", n_jobs=-1)\n",
1057
+ " .fit(X_tr, y_tr)\n",
1058
+ " .best_estimator_)\n",
1059
+ "\n",
1060
+ " y_hat = best.predict(X_te)\n",
1061
+ " proba = (best.predict_proba(X_te)[:,1]\n",
1062
+ " if hasattr(best, \"predict_proba\") else None)\n",
1063
+ "\n",
1064
+ " acc = accuracy_score(y_te, y_hat)\n",
1065
+ " f1 = f1_score(y_te, y_hat, zero_division=0)\n",
1066
+ " auc = (roc_auc_score(y_te, proba)\n",
1067
+ " if proba is not None and len(np.unique(y_te))==2 else np.nan)\n",
1068
+ "\n",
1069
+ " records.append({\"Model\":mname,\"Scenario\":scen,\"Split\":split_idx,\n",
1070
+ " \"Accuracy\":acc,\"F1\":f1,\"AUC\":auc})\n",
1071
+ "\n",
1072
+ "# ─────────────── 7β”‚ summary tables ──────────────────────────────\n",
1073
+ "tbl = pd.DataFrame(records)\n",
1074
+ "\n",
1075
+ "def metric_tbl(metric, fmt):\n",
1076
+ " piv = tbl.pivot_table(index=[\"Model\",\"Scenario\"], columns=\"Split\", values=metric)\n",
1077
+ " piv[\"Avg\"] = piv.mean(axis=1)\n",
1078
+ " piv[\"Max\"] = piv[[1,2,3,4,5]].max(axis=1)\n",
1079
+ " return piv.applymap(fmt)\n",
1080
+ "\n",
1081
+ "pct = lambda x: f\"{x:.2%}\"\n",
1082
+ "auc_fmt = lambda x: f\"{x:.4f}\"\n",
1083
+ "\n",
1084
+ "print(\"\\n── Accuracy per split (plus Avg & Max) ──\")\n",
1085
+ "display(metric_tbl(\"Accuracy\", pct))\n",
1086
+ "print(\"\\n── F1-score per split (plus Avg & Max) ──\")\n",
1087
+ "display(metric_tbl(\"F1\", pct))\n",
1088
+ "print(\"\\n── AUC per split (plus Avg & Max) ──\")\n",
1089
+ "display(metric_tbl(\"AUC\", auc_fmt))\n"
1090
+ ]
1091
+ },
1092
+ {
1093
+ "cell_type": "code",
1094
+ "execution_count": 6,
1095
+ "id": "c36800df",
1096
+ "metadata": {},
1097
+ "outputs": [
1098
+ {
1099
+ "name": "stdout",
1100
+ "output_type": "stream",
1101
+ "text": [
1102
+ "\n",
1103
+ "── Label distribution across five splits ──\n"
1104
+ ]
1105
+ },
1106
+ {
1107
+ "data": {
1108
+ "text/html": [
1109
+ "<div>\n",
1110
+ "<style scoped>\n",
1111
+ " .dataframe tbody tr th:only-of-type {\n",
1112
+ " vertical-align: middle;\n",
1113
+ " }\n",
1114
+ "\n",
1115
+ " .dataframe tbody tr th {\n",
1116
+ " vertical-align: top;\n",
1117
+ " }\n",
1118
+ "\n",
1119
+ " .dataframe thead th {\n",
1120
+ " text-align: right;\n",
1121
+ " }\n",
1122
+ "</style>\n",
1123
+ "<table border=\"1\" class=\"dataframe\">\n",
1124
+ " <thead>\n",
1125
+ " <tr style=\"text-align: right;\">\n",
1126
+ " <th></th>\n",
1127
+ " <th>Train 0</th>\n",
1128
+ " <th>Train 1</th>\n",
1129
+ " <th>Train 0 %</th>\n",
1130
+ " <th>Train 1 %</th>\n",
1131
+ " <th>Test 0</th>\n",
1132
+ " <th>Test 1</th>\n",
1133
+ " <th>Test 0 %</th>\n",
1134
+ " <th>Test 1 %</th>\n",
1135
+ " </tr>\n",
1136
+ " <tr>\n",
1137
+ " <th>Split</th>\n",
1138
+ " <th></th>\n",
1139
+ " <th></th>\n",
1140
+ " <th></th>\n",
1141
+ " <th></th>\n",
1142
+ " <th></th>\n",
1143
+ " <th></th>\n",
1144
+ " <th></th>\n",
1145
+ " <th></th>\n",
1146
+ " </tr>\n",
1147
+ " </thead>\n",
1148
+ " <tbody>\n",
1149
+ " <tr>\n",
1150
+ " <th>1</th>\n",
1151
+ " <td>67</td>\n",
1152
+ " <td>82</td>\n",
1153
+ " <td>45.0%</td>\n",
1154
+ " <td>55.0%</td>\n",
1155
+ " <td>26</td>\n",
1156
+ " <td>30</td>\n",
1157
+ " <td>46.4%</td>\n",
1158
+ " <td>53.6%</td>\n",
1159
+ " </tr>\n",
1160
+ " <tr>\n",
1161
+ " <th>2</th>\n",
1162
+ " <td>74</td>\n",
1163
+ " <td>88</td>\n",
1164
+ " <td>45.7%</td>\n",
1165
+ " <td>54.3%</td>\n",
1166
+ " <td>26</td>\n",
1167
+ " <td>30</td>\n",
1168
+ " <td>46.4%</td>\n",
1169
+ " <td>53.6%</td>\n",
1170
+ " </tr>\n",
1171
+ " <tr>\n",
1172
+ " <th>3</th>\n",
1173
+ " <td>83</td>\n",
1174
+ " <td>92</td>\n",
1175
+ " <td>47.4%</td>\n",
1176
+ " <td>52.6%</td>\n",
1177
+ " <td>25</td>\n",
1178
+ " <td>31</td>\n",
1179
+ " <td>44.6%</td>\n",
1180
+ " <td>55.4%</td>\n",
1181
+ " </tr>\n",
1182
+ " <tr>\n",
1183
+ " <th>4</th>\n",
1184
+ " <td>89</td>\n",
1185
+ " <td>99</td>\n",
1186
+ " <td>47.3%</td>\n",
1187
+ " <td>52.7%</td>\n",
1188
+ " <td>25</td>\n",
1189
+ " <td>31</td>\n",
1190
+ " <td>44.6%</td>\n",
1191
+ " <td>55.4%</td>\n",
1192
+ " </tr>\n",
1193
+ " <tr>\n",
1194
+ " <th>5</th>\n",
1195
+ " <td>92</td>\n",
1196
+ " <td>109</td>\n",
1197
+ " <td>45.8%</td>\n",
1198
+ " <td>54.2%</td>\n",
1199
+ " <td>28</td>\n",
1200
+ " <td>28</td>\n",
1201
+ " <td>50.0%</td>\n",
1202
+ " <td>50.0%</td>\n",
1203
+ " </tr>\n",
1204
+ " </tbody>\n",
1205
+ "</table>\n",
1206
+ "</div>"
1207
+ ],
1208
+ "text/plain": [
1209
+ " Train 0 Train 1 Train 0 % Train 1 % Test 0 Test 1 Test 0 % Test 1 %\n",
1210
+ "Split \n",
1211
+ "1 67 82 45.0% 55.0% 26 30 46.4% 53.6%\n",
1212
+ "2 74 88 45.7% 54.3% 26 30 46.4% 53.6%\n",
1213
+ "3 83 92 47.4% 52.6% 25 31 44.6% 55.4%\n",
1214
+ "4 89 99 47.3% 52.7% 25 31 44.6% 55.4%\n",
1215
+ "5 92 109 45.8% 54.2% 28 28 50.0% 50.0%"
1216
+ ]
1217
+ },
1218
+ "metadata": {},
1219
+ "output_type": "display_data"
1220
+ },
1221
+ {
1222
+ "name": "stdout",
1223
+ "output_type": "stream",
1224
+ "text": [
1225
+ "\n",
1226
+ "── Accuracy per split (plus Avg & Max) ──\n"
1227
+ ]
1228
+ },
1229
+ {
1230
+ "data": {
1231
+ "text/html": [
1232
+ "<div>\n",
1233
+ "<style scoped>\n",
1234
+ " .dataframe tbody tr th:only-of-type {\n",
1235
+ " vertical-align: middle;\n",
1236
+ " }\n",
1237
+ "\n",
1238
+ " .dataframe tbody tr th {\n",
1239
+ " vertical-align: top;\n",
1240
+ " }\n",
1241
+ "\n",
1242
+ " .dataframe thead th {\n",
1243
+ " text-align: right;\n",
1244
+ " }\n",
1245
+ "</style>\n",
1246
+ "<table border=\"1\" class=\"dataframe\">\n",
1247
+ " <thead>\n",
1248
+ " <tr style=\"text-align: right;\">\n",
1249
+ " <th></th>\n",
1250
+ " <th>Split</th>\n",
1251
+ " <th>1</th>\n",
1252
+ " <th>2</th>\n",
1253
+ " <th>3</th>\n",
1254
+ " <th>4</th>\n",
1255
+ " <th>5</th>\n",
1256
+ " <th>Avg</th>\n",
1257
+ " <th>Max</th>\n",
1258
+ " </tr>\n",
1259
+ " <tr>\n",
1260
+ " <th>Model</th>\n",
1261
+ " <th>Scenario</th>\n",
1262
+ " <th></th>\n",
1263
+ " <th></th>\n",
1264
+ " <th></th>\n",
1265
+ " <th></th>\n",
1266
+ " <th></th>\n",
1267
+ " <th></th>\n",
1268
+ " <th></th>\n",
1269
+ " </tr>\n",
1270
+ " </thead>\n",
1271
+ " <tbody>\n",
1272
+ " <tr>\n",
1273
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
1274
+ " <th>0.05</th>\n",
1275
+ " <td>48.21%</td>\n",
1276
+ " <td>51.79%</td>\n",
1277
+ " <td>62.50%</td>\n",
1278
+ " <td>60.71%</td>\n",
1279
+ " <td>50.00%</td>\n",
1280
+ " <td>54.64%</td>\n",
1281
+ " <td>62.50%</td>\n",
1282
+ " </tr>\n",
1283
+ " <tr>\n",
1284
+ " <th>0.10</th>\n",
1285
+ " <td>39.29%</td>\n",
1286
+ " <td>50.00%</td>\n",
1287
+ " <td>50.00%</td>\n",
1288
+ " <td>51.79%</td>\n",
1289
+ " <td>48.21%</td>\n",
1290
+ " <td>47.86%</td>\n",
1291
+ " <td>51.79%</td>\n",
1292
+ " </tr>\n",
1293
+ " <tr>\n",
1294
+ " <th>without</th>\n",
1295
+ " <td>60.71%</td>\n",
1296
+ " <td>64.29%</td>\n",
1297
+ " <td>55.36%</td>\n",
1298
+ " <td>55.36%</td>\n",
1299
+ " <td>50.00%</td>\n",
1300
+ " <td>57.14%</td>\n",
1301
+ " <td>64.29%</td>\n",
1302
+ " </tr>\n",
1303
+ " <tr>\n",
1304
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
1305
+ " <th>0.05</th>\n",
1306
+ " <td>50.00%</td>\n",
1307
+ " <td>46.43%</td>\n",
1308
+ " <td>46.43%</td>\n",
1309
+ " <td>50.00%</td>\n",
1310
+ " <td>51.79%</td>\n",
1311
+ " <td>48.93%</td>\n",
1312
+ " <td>51.79%</td>\n",
1313
+ " </tr>\n",
1314
+ " <tr>\n",
1315
+ " <th>0.10</th>\n",
1316
+ " <td>44.64%</td>\n",
1317
+ " <td>50.00%</td>\n",
1318
+ " <td>51.79%</td>\n",
1319
+ " <td>53.57%</td>\n",
1320
+ " <td>51.79%</td>\n",
1321
+ " <td>50.36%</td>\n",
1322
+ " <td>53.57%</td>\n",
1323
+ " </tr>\n",
1324
+ " <tr>\n",
1325
+ " <th>without</th>\n",
1326
+ " <td>55.36%</td>\n",
1327
+ " <td>58.93%</td>\n",
1328
+ " <td>55.36%</td>\n",
1329
+ " <td>53.57%</td>\n",
1330
+ " <td>50.00%</td>\n",
1331
+ " <td>54.64%</td>\n",
1332
+ " <td>58.93%</td>\n",
1333
+ " </tr>\n",
1334
+ " <tr>\n",
1335
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
1336
+ " <th>0.05</th>\n",
1337
+ " <td>41.07%</td>\n",
1338
+ " <td>42.86%</td>\n",
1339
+ " <td>44.64%</td>\n",
1340
+ " <td>53.57%</td>\n",
1341
+ " <td>50.00%</td>\n",
1342
+ " <td>46.43%</td>\n",
1343
+ " <td>53.57%</td>\n",
1344
+ " </tr>\n",
1345
+ " <tr>\n",
1346
+ " <th>0.10</th>\n",
1347
+ " <td>44.64%</td>\n",
1348
+ " <td>51.79%</td>\n",
1349
+ " <td>50.00%</td>\n",
1350
+ " <td>55.36%</td>\n",
1351
+ " <td>50.00%</td>\n",
1352
+ " <td>50.36%</td>\n",
1353
+ " <td>55.36%</td>\n",
1354
+ " </tr>\n",
1355
+ " <tr>\n",
1356
+ " <th>without</th>\n",
1357
+ " <td>55.36%</td>\n",
1358
+ " <td>57.14%</td>\n",
1359
+ " <td>60.71%</td>\n",
1360
+ " <td>58.93%</td>\n",
1361
+ " <td>58.93%</td>\n",
1362
+ " <td>58.21%</td>\n",
1363
+ " <td>60.71%</td>\n",
1364
+ " </tr>\n",
1365
+ " <tr>\n",
1366
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
1367
+ " <th>0.05</th>\n",
1368
+ " <td>53.57%</td>\n",
1369
+ " <td>48.21%</td>\n",
1370
+ " <td>46.43%</td>\n",
1371
+ " <td>48.21%</td>\n",
1372
+ " <td>53.57%</td>\n",
1373
+ " <td>50.00%</td>\n",
1374
+ " <td>53.57%</td>\n",
1375
+ " </tr>\n",
1376
+ " <tr>\n",
1377
+ " <th>0.10</th>\n",
1378
+ " <td>53.57%</td>\n",
1379
+ " <td>55.36%</td>\n",
1380
+ " <td>53.57%</td>\n",
1381
+ " <td>57.14%</td>\n",
1382
+ " <td>50.00%</td>\n",
1383
+ " <td>53.93%</td>\n",
1384
+ " <td>57.14%</td>\n",
1385
+ " </tr>\n",
1386
+ " <tr>\n",
1387
+ " <th>without</th>\n",
1388
+ " <td>53.57%</td>\n",
1389
+ " <td>55.36%</td>\n",
1390
+ " <td>53.57%</td>\n",
1391
+ " <td>50.00%</td>\n",
1392
+ " <td>50.00%</td>\n",
1393
+ " <td>52.50%</td>\n",
1394
+ " <td>55.36%</td>\n",
1395
+ " </tr>\n",
1396
+ " <tr>\n",
1397
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
1398
+ " <th>0.05</th>\n",
1399
+ " <td>42.86%</td>\n",
1400
+ " <td>48.21%</td>\n",
1401
+ " <td>53.57%</td>\n",
1402
+ " <td>53.57%</td>\n",
1403
+ " <td>48.21%</td>\n",
1404
+ " <td>49.29%</td>\n",
1405
+ " <td>53.57%</td>\n",
1406
+ " </tr>\n",
1407
+ " <tr>\n",
1408
+ " <th>0.10</th>\n",
1409
+ " <td>50.00%</td>\n",
1410
+ " <td>51.79%</td>\n",
1411
+ " <td>51.79%</td>\n",
1412
+ " <td>53.57%</td>\n",
1413
+ " <td>53.57%</td>\n",
1414
+ " <td>52.14%</td>\n",
1415
+ " <td>53.57%</td>\n",
1416
+ " </tr>\n",
1417
+ " <tr>\n",
1418
+ " <th>without</th>\n",
1419
+ " <td>62.50%</td>\n",
1420
+ " <td>58.93%</td>\n",
1421
+ " <td>60.71%</td>\n",
1422
+ " <td>55.36%</td>\n",
1423
+ " <td>62.50%</td>\n",
1424
+ " <td>60.00%</td>\n",
1425
+ " <td>62.50%</td>\n",
1426
+ " </tr>\n",
1427
+ " </tbody>\n",
1428
+ "</table>\n",
1429
+ "</div>"
1430
+ ],
1431
+ "text/plain": [
1432
+ "Split 1 2 3 4 5 Avg \\\n",
1433
+ "Model Scenario \n",
1434
+ "Decision Tree 0.05 48.21% 51.79% 62.50% 60.71% 50.00% 54.64% \n",
1435
+ " 0.10 39.29% 50.00% 50.00% 51.79% 48.21% 47.86% \n",
1436
+ " without 60.71% 64.29% 55.36% 55.36% 50.00% 57.14% \n",
1437
+ "Logistic Regression 0.05 50.00% 46.43% 46.43% 50.00% 51.79% 48.93% \n",
1438
+ " 0.10 44.64% 50.00% 51.79% 53.57% 51.79% 50.36% \n",
1439
+ " without 55.36% 58.93% 55.36% 53.57% 50.00% 54.64% \n",
1440
+ "Random Forest 0.05 41.07% 42.86% 44.64% 53.57% 50.00% 46.43% \n",
1441
+ " 0.10 44.64% 51.79% 50.00% 55.36% 50.00% 50.36% \n",
1442
+ " without 55.36% 57.14% 60.71% 58.93% 58.93% 58.21% \n",
1443
+ "SVM 0.05 53.57% 48.21% 46.43% 48.21% 53.57% 50.00% \n",
1444
+ " 0.10 53.57% 55.36% 53.57% 57.14% 50.00% 53.93% \n",
1445
+ " without 53.57% 55.36% 53.57% 50.00% 50.00% 52.50% \n",
1446
+ "XGBoost 0.05 42.86% 48.21% 53.57% 53.57% 48.21% 49.29% \n",
1447
+ " 0.10 50.00% 51.79% 51.79% 53.57% 53.57% 52.14% \n",
1448
+ " without 62.50% 58.93% 60.71% 55.36% 62.50% 60.00% \n",
1449
+ "\n",
1450
+ "Split Max \n",
1451
+ "Model Scenario \n",
1452
+ "Decision Tree 0.05 62.50% \n",
1453
+ " 0.10 51.79% \n",
1454
+ " without 64.29% \n",
1455
+ "Logistic Regression 0.05 51.79% \n",
1456
+ " 0.10 53.57% \n",
1457
+ " without 58.93% \n",
1458
+ "Random Forest 0.05 53.57% \n",
1459
+ " 0.10 55.36% \n",
1460
+ " without 60.71% \n",
1461
+ "SVM 0.05 53.57% \n",
1462
+ " 0.10 57.14% \n",
1463
+ " without 55.36% \n",
1464
+ "XGBoost 0.05 53.57% \n",
1465
+ " 0.10 53.57% \n",
1466
+ " without 62.50% "
1467
+ ]
1468
+ },
1469
+ "metadata": {},
1470
+ "output_type": "display_data"
1471
+ },
1472
+ {
1473
+ "name": "stdout",
1474
+ "output_type": "stream",
1475
+ "text": [
1476
+ "\n",
1477
+ "── F1-score per split (plus Avg & Max) ──\n"
1478
+ ]
1479
+ },
1480
+ {
1481
+ "data": {
1482
+ "text/html": [
1483
+ "<div>\n",
1484
+ "<style scoped>\n",
1485
+ " .dataframe tbody tr th:only-of-type {\n",
1486
+ " vertical-align: middle;\n",
1487
+ " }\n",
1488
+ "\n",
1489
+ " .dataframe tbody tr th {\n",
1490
+ " vertical-align: top;\n",
1491
+ " }\n",
1492
+ "\n",
1493
+ " .dataframe thead th {\n",
1494
+ " text-align: right;\n",
1495
+ " }\n",
1496
+ "</style>\n",
1497
+ "<table border=\"1\" class=\"dataframe\">\n",
1498
+ " <thead>\n",
1499
+ " <tr style=\"text-align: right;\">\n",
1500
+ " <th></th>\n",
1501
+ " <th>Split</th>\n",
1502
+ " <th>1</th>\n",
1503
+ " <th>2</th>\n",
1504
+ " <th>3</th>\n",
1505
+ " <th>4</th>\n",
1506
+ " <th>5</th>\n",
1507
+ " <th>Avg</th>\n",
1508
+ " <th>Max</th>\n",
1509
+ " </tr>\n",
1510
+ " <tr>\n",
1511
+ " <th>Model</th>\n",
1512
+ " <th>Scenario</th>\n",
1513
+ " <th></th>\n",
1514
+ " <th></th>\n",
1515
+ " <th></th>\n",
1516
+ " <th></th>\n",
1517
+ " <th></th>\n",
1518
+ " <th></th>\n",
1519
+ " <th></th>\n",
1520
+ " </tr>\n",
1521
+ " </thead>\n",
1522
+ " <tbody>\n",
1523
+ " <tr>\n",
1524
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
1525
+ " <th>0.05</th>\n",
1526
+ " <td>57.97%</td>\n",
1527
+ " <td>49.06%</td>\n",
1528
+ " <td>64.41%</td>\n",
1529
+ " <td>52.17%</td>\n",
1530
+ " <td>66.67%</td>\n",
1531
+ " <td>58.05%</td>\n",
1532
+ " <td>66.67%</td>\n",
1533
+ " </tr>\n",
1534
+ " <tr>\n",
1535
+ " <th>0.10</th>\n",
1536
+ " <td>45.16%</td>\n",
1537
+ " <td>48.15%</td>\n",
1538
+ " <td>46.15%</td>\n",
1539
+ " <td>37.21%</td>\n",
1540
+ " <td>0.00%</td>\n",
1541
+ " <td>35.33%</td>\n",
1542
+ " <td>48.15%</td>\n",
1543
+ " </tr>\n",
1544
+ " <tr>\n",
1545
+ " <th>without</th>\n",
1546
+ " <td>60.71%</td>\n",
1547
+ " <td>65.52%</td>\n",
1548
+ " <td>44.44%</td>\n",
1549
+ " <td>48.98%</td>\n",
1550
+ " <td>0.00%</td>\n",
1551
+ " <td>43.93%</td>\n",
1552
+ " <td>65.52%</td>\n",
1553
+ " </tr>\n",
1554
+ " <tr>\n",
1555
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
1556
+ " <th>0.05</th>\n",
1557
+ " <td>54.84%</td>\n",
1558
+ " <td>37.50%</td>\n",
1559
+ " <td>21.05%</td>\n",
1560
+ " <td>17.65%</td>\n",
1561
+ " <td>64.00%</td>\n",
1562
+ " <td>39.01%</td>\n",
1563
+ " <td>64.00%</td>\n",
1564
+ " </tr>\n",
1565
+ " <tr>\n",
1566
+ " <th>0.10</th>\n",
1567
+ " <td>57.53%</td>\n",
1568
+ " <td>54.84%</td>\n",
1569
+ " <td>42.55%</td>\n",
1570
+ " <td>35.00%</td>\n",
1571
+ " <td>61.97%</td>\n",
1572
+ " <td>50.38%</td>\n",
1573
+ " <td>61.97%</td>\n",
1574
+ " </tr>\n",
1575
+ " <tr>\n",
1576
+ " <th>without</th>\n",
1577
+ " <td>61.54%</td>\n",
1578
+ " <td>54.90%</td>\n",
1579
+ " <td>50.98%</td>\n",
1580
+ " <td>35.00%</td>\n",
1581
+ " <td>0.00%</td>\n",
1582
+ " <td>40.48%</td>\n",
1583
+ " <td>61.54%</td>\n",
1584
+ " </tr>\n",
1585
+ " <tr>\n",
1586
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
1587
+ " <th>0.05</th>\n",
1588
+ " <td>44.07%</td>\n",
1589
+ " <td>38.46%</td>\n",
1590
+ " <td>20.51%</td>\n",
1591
+ " <td>27.78%</td>\n",
1592
+ " <td>65.85%</td>\n",
1593
+ " <td>39.33%</td>\n",
1594
+ " <td>65.85%</td>\n",
1595
+ " </tr>\n",
1596
+ " <tr>\n",
1597
+ " <th>0.10</th>\n",
1598
+ " <td>52.31%</td>\n",
1599
+ " <td>59.70%</td>\n",
1600
+ " <td>22.22%</td>\n",
1601
+ " <td>35.90%</td>\n",
1602
+ " <td>66.67%</td>\n",
1603
+ " <td>47.36%</td>\n",
1604
+ " <td>66.67%</td>\n",
1605
+ " </tr>\n",
1606
+ " <tr>\n",
1607
+ " <th>without</th>\n",
1608
+ " <td>62.69%</td>\n",
1609
+ " <td>60.00%</td>\n",
1610
+ " <td>59.26%</td>\n",
1611
+ " <td>59.65%</td>\n",
1612
+ " <td>51.06%</td>\n",
1613
+ " <td>58.53%</td>\n",
1614
+ " <td>62.69%</td>\n",
1615
+ " </tr>\n",
1616
+ " <tr>\n",
1617
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
1618
+ " <th>0.05</th>\n",
1619
+ " <td>69.77%</td>\n",
1620
+ " <td>29.27%</td>\n",
1621
+ " <td>54.55%</td>\n",
1622
+ " <td>50.85%</td>\n",
1623
+ " <td>65.79%</td>\n",
1624
+ " <td>54.04%</td>\n",
1625
+ " <td>69.77%</td>\n",
1626
+ " </tr>\n",
1627
+ " <tr>\n",
1628
+ " <th>0.10</th>\n",
1629
+ " <td>69.77%</td>\n",
1630
+ " <td>63.77%</td>\n",
1631
+ " <td>64.86%</td>\n",
1632
+ " <td>71.43%</td>\n",
1633
+ " <td>66.67%</td>\n",
1634
+ " <td>67.30%</td>\n",
1635
+ " <td>71.43%</td>\n",
1636
+ " </tr>\n",
1637
+ " <tr>\n",
1638
+ " <th>without</th>\n",
1639
+ " <td>69.77%</td>\n",
1640
+ " <td>32.43%</td>\n",
1641
+ " <td>38.10%</td>\n",
1642
+ " <td>22.22%</td>\n",
1643
+ " <td>0.00%</td>\n",
1644
+ " <td>32.50%</td>\n",
1645
+ " <td>69.77%</td>\n",
1646
+ " </tr>\n",
1647
+ " <tr>\n",
1648
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
1649
+ " <th>0.05</th>\n",
1650
+ " <td>46.67%</td>\n",
1651
+ " <td>45.28%</td>\n",
1652
+ " <td>51.85%</td>\n",
1653
+ " <td>35.00%</td>\n",
1654
+ " <td>43.14%</td>\n",
1655
+ " <td>44.39%</td>\n",
1656
+ " <td>51.85%</td>\n",
1657
+ " </tr>\n",
1658
+ " <tr>\n",
1659
+ " <th>0.10</th>\n",
1660
+ " <td>56.25%</td>\n",
1661
+ " <td>55.74%</td>\n",
1662
+ " <td>50.91%</td>\n",
1663
+ " <td>31.58%</td>\n",
1664
+ " <td>65.79%</td>\n",
1665
+ " <td>52.05%</td>\n",
1666
+ " <td>65.79%</td>\n",
1667
+ " </tr>\n",
1668
+ " <tr>\n",
1669
+ " <th>without</th>\n",
1670
+ " <td>67.69%</td>\n",
1671
+ " <td>62.30%</td>\n",
1672
+ " <td>60.71%</td>\n",
1673
+ " <td>52.83%</td>\n",
1674
+ " <td>58.82%</td>\n",
1675
+ " <td>60.47%</td>\n",
1676
+ " <td>67.69%</td>\n",
1677
+ " </tr>\n",
1678
+ " </tbody>\n",
1679
+ "</table>\n",
1680
+ "</div>"
1681
+ ],
1682
+ "text/plain": [
1683
+ "Split 1 2 3 4 5 Avg \\\n",
1684
+ "Model Scenario \n",
1685
+ "Decision Tree 0.05 57.97% 49.06% 64.41% 52.17% 66.67% 58.05% \n",
1686
+ " 0.10 45.16% 48.15% 46.15% 37.21% 0.00% 35.33% \n",
1687
+ " without 60.71% 65.52% 44.44% 48.98% 0.00% 43.93% \n",
1688
+ "Logistic Regression 0.05 54.84% 37.50% 21.05% 17.65% 64.00% 39.01% \n",
1689
+ " 0.10 57.53% 54.84% 42.55% 35.00% 61.97% 50.38% \n",
1690
+ " without 61.54% 54.90% 50.98% 35.00% 0.00% 40.48% \n",
1691
+ "Random Forest 0.05 44.07% 38.46% 20.51% 27.78% 65.85% 39.33% \n",
1692
+ " 0.10 52.31% 59.70% 22.22% 35.90% 66.67% 47.36% \n",
1693
+ " without 62.69% 60.00% 59.26% 59.65% 51.06% 58.53% \n",
1694
+ "SVM 0.05 69.77% 29.27% 54.55% 50.85% 65.79% 54.04% \n",
1695
+ " 0.10 69.77% 63.77% 64.86% 71.43% 66.67% 67.30% \n",
1696
+ " without 69.77% 32.43% 38.10% 22.22% 0.00% 32.50% \n",
1697
+ "XGBoost 0.05 46.67% 45.28% 51.85% 35.00% 43.14% 44.39% \n",
1698
+ " 0.10 56.25% 55.74% 50.91% 31.58% 65.79% 52.05% \n",
1699
+ " without 67.69% 62.30% 60.71% 52.83% 58.82% 60.47% \n",
1700
+ "\n",
1701
+ "Split Max \n",
1702
+ "Model Scenario \n",
1703
+ "Decision Tree 0.05 66.67% \n",
1704
+ " 0.10 48.15% \n",
1705
+ " without 65.52% \n",
1706
+ "Logistic Regression 0.05 64.00% \n",
1707
+ " 0.10 61.97% \n",
1708
+ " without 61.54% \n",
1709
+ "Random Forest 0.05 65.85% \n",
1710
+ " 0.10 66.67% \n",
1711
+ " without 62.69% \n",
1712
+ "SVM 0.05 69.77% \n",
1713
+ " 0.10 71.43% \n",
1714
+ " without 69.77% \n",
1715
+ "XGBoost 0.05 51.85% \n",
1716
+ " 0.10 65.79% \n",
1717
+ " without 67.69% "
1718
+ ]
1719
+ },
1720
+ "metadata": {},
1721
+ "output_type": "display_data"
1722
+ },
1723
+ {
1724
+ "name": "stdout",
1725
+ "output_type": "stream",
1726
+ "text": [
1727
+ "\n",
1728
+ "── AUC per split (plus Avg & Max) ──\n"
1729
+ ]
1730
+ },
1731
+ {
1732
+ "data": {
1733
+ "text/html": [
1734
+ "<div>\n",
1735
+ "<style scoped>\n",
1736
+ " .dataframe tbody tr th:only-of-type {\n",
1737
+ " vertical-align: middle;\n",
1738
+ " }\n",
1739
+ "\n",
1740
+ " .dataframe tbody tr th {\n",
1741
+ " vertical-align: top;\n",
1742
+ " }\n",
1743
+ "\n",
1744
+ " .dataframe thead th {\n",
1745
+ " text-align: right;\n",
1746
+ " }\n",
1747
+ "</style>\n",
1748
+ "<table border=\"1\" class=\"dataframe\">\n",
1749
+ " <thead>\n",
1750
+ " <tr style=\"text-align: right;\">\n",
1751
+ " <th></th>\n",
1752
+ " <th>Split</th>\n",
1753
+ " <th>1</th>\n",
1754
+ " <th>2</th>\n",
1755
+ " <th>3</th>\n",
1756
+ " <th>4</th>\n",
1757
+ " <th>5</th>\n",
1758
+ " <th>Avg</th>\n",
1759
+ " <th>Max</th>\n",
1760
+ " </tr>\n",
1761
+ " <tr>\n",
1762
+ " <th>Model</th>\n",
1763
+ " <th>Scenario</th>\n",
1764
+ " <th></th>\n",
1765
+ " <th></th>\n",
1766
+ " <th></th>\n",
1767
+ " <th></th>\n",
1768
+ " <th></th>\n",
1769
+ " <th></th>\n",
1770
+ " <th></th>\n",
1771
+ " </tr>\n",
1772
+ " </thead>\n",
1773
+ " <tbody>\n",
1774
+ " <tr>\n",
1775
+ " <th rowspan=\"3\" valign=\"top\">Decision Tree</th>\n",
1776
+ " <th>0.05</th>\n",
1777
+ " <td>0.4359</td>\n",
1778
+ " <td>0.5449</td>\n",
1779
+ " <td>0.6174</td>\n",
1780
+ " <td>0.6594</td>\n",
1781
+ " <td>0.5000</td>\n",
1782
+ " <td>0.5515</td>\n",
1783
+ " <td>0.6594</td>\n",
1784
+ " </tr>\n",
1785
+ " <tr>\n",
1786
+ " <th>0.10</th>\n",
1787
+ " <td>0.4231</td>\n",
1788
+ " <td>0.5372</td>\n",
1789
+ " <td>0.5465</td>\n",
1790
+ " <td>0.5774</td>\n",
1791
+ " <td>0.4764</td>\n",
1792
+ " <td>0.5121</td>\n",
1793
+ " <td>0.5774</td>\n",
1794
+ " </tr>\n",
1795
+ " <tr>\n",
1796
+ " <th>without</th>\n",
1797
+ " <td>0.6083</td>\n",
1798
+ " <td>0.6090</td>\n",
1799
+ " <td>0.6129</td>\n",
1800
+ " <td>0.6271</td>\n",
1801
+ " <td>0.5000</td>\n",
1802
+ " <td>0.5915</td>\n",
1803
+ " <td>0.6271</td>\n",
1804
+ " </tr>\n",
1805
+ " <tr>\n",
1806
+ " <th rowspan=\"3\" valign=\"top\">Logistic Regression</th>\n",
1807
+ " <th>0.05</th>\n",
1808
+ " <td>0.5077</td>\n",
1809
+ " <td>0.4679</td>\n",
1810
+ " <td>0.5768</td>\n",
1811
+ " <td>0.5523</td>\n",
1812
+ " <td>0.5867</td>\n",
1813
+ " <td>0.5383</td>\n",
1814
+ " <td>0.5867</td>\n",
1815
+ " </tr>\n",
1816
+ " <tr>\n",
1817
+ " <th>0.10</th>\n",
1818
+ " <td>0.5667</td>\n",
1819
+ " <td>0.5231</td>\n",
1820
+ " <td>0.5832</td>\n",
1821
+ " <td>0.5368</td>\n",
1822
+ " <td>0.5536</td>\n",
1823
+ " <td>0.5527</td>\n",
1824
+ " <td>0.5832</td>\n",
1825
+ " </tr>\n",
1826
+ " <tr>\n",
1827
+ " <th>without</th>\n",
1828
+ " <td>0.5603</td>\n",
1829
+ " <td>0.5808</td>\n",
1830
+ " <td>0.6348</td>\n",
1831
+ " <td>0.6845</td>\n",
1832
+ " <td>0.6250</td>\n",
1833
+ " <td>0.6171</td>\n",
1834
+ " <td>0.6845</td>\n",
1835
+ " </tr>\n",
1836
+ " <tr>\n",
1837
+ " <th rowspan=\"3\" valign=\"top\">Random Forest</th>\n",
1838
+ " <th>0.05</th>\n",
1839
+ " <td>0.3769</td>\n",
1840
+ " <td>0.4333</td>\n",
1841
+ " <td>0.6129</td>\n",
1842
+ " <td>0.6400</td>\n",
1843
+ " <td>0.6798</td>\n",
1844
+ " <td>0.5486</td>\n",
1845
+ " <td>0.6798</td>\n",
1846
+ " </tr>\n",
1847
+ " <tr>\n",
1848
+ " <th>0.10</th>\n",
1849
+ " <td>0.3949</td>\n",
1850
+ " <td>0.4205</td>\n",
1851
+ " <td>0.5316</td>\n",
1852
+ " <td>0.6258</td>\n",
1853
+ " <td>0.6033</td>\n",
1854
+ " <td>0.5152</td>\n",
1855
+ " <td>0.6258</td>\n",
1856
+ " </tr>\n",
1857
+ " <tr>\n",
1858
+ " <th>without</th>\n",
1859
+ " <td>0.5160</td>\n",
1860
+ " <td>0.6147</td>\n",
1861
+ " <td>0.6277</td>\n",
1862
+ " <td>0.6323</td>\n",
1863
+ " <td>0.6078</td>\n",
1864
+ " <td>0.5997</td>\n",
1865
+ " <td>0.6323</td>\n",
1866
+ " </tr>\n",
1867
+ " <tr>\n",
1868
+ " <th rowspan=\"3\" valign=\"top\">SVM</th>\n",
1869
+ " <th>0.05</th>\n",
1870
+ " <td>0.5333</td>\n",
1871
+ " <td>0.4385</td>\n",
1872
+ " <td>0.4974</td>\n",
1873
+ " <td>0.4916</td>\n",
1874
+ " <td>0.5587</td>\n",
1875
+ " <td>0.5039</td>\n",
1876
+ " <td>0.5587</td>\n",
1877
+ " </tr>\n",
1878
+ " <tr>\n",
1879
+ " <th>0.10</th>\n",
1880
+ " <td>0.5000</td>\n",
1881
+ " <td>0.5321</td>\n",
1882
+ " <td>0.4594</td>\n",
1883
+ " <td>0.5058</td>\n",
1884
+ " <td>0.5995</td>\n",
1885
+ " <td>0.5193</td>\n",
1886
+ " <td>0.5995</td>\n",
1887
+ " </tr>\n",
1888
+ " <tr>\n",
1889
+ " <th>without</th>\n",
1890
+ " <td>0.5192</td>\n",
1891
+ " <td>0.5718</td>\n",
1892
+ " <td>0.3742</td>\n",
1893
+ " <td>0.3155</td>\n",
1894
+ " <td>0.3750</td>\n",
1895
+ " <td>0.4311</td>\n",
1896
+ " <td>0.5718</td>\n",
1897
+ " </tr>\n",
1898
+ " <tr>\n",
1899
+ " <th rowspan=\"3\" valign=\"top\">XGBoost</th>\n",
1900
+ " <th>0.05</th>\n",
1901
+ " <td>0.5026</td>\n",
1902
+ " <td>0.5051</td>\n",
1903
+ " <td>0.5923</td>\n",
1904
+ " <td>0.6335</td>\n",
1905
+ " <td>0.5434</td>\n",
1906
+ " <td>0.5554</td>\n",
1907
+ " <td>0.6335</td>\n",
1908
+ " </tr>\n",
1909
+ " <tr>\n",
1910
+ " <th>0.10</th>\n",
1911
+ " <td>0.5000</td>\n",
1912
+ " <td>0.4744</td>\n",
1913
+ " <td>0.5458</td>\n",
1914
+ " <td>0.6529</td>\n",
1915
+ " <td>0.6059</td>\n",
1916
+ " <td>0.5558</td>\n",
1917
+ " <td>0.6529</td>\n",
1918
+ " </tr>\n",
1919
+ " <tr>\n",
1920
+ " <th>without</th>\n",
1921
+ " <td>0.6321</td>\n",
1922
+ " <td>0.5872</td>\n",
1923
+ " <td>0.6058</td>\n",
1924
+ " <td>0.6181</td>\n",
1925
+ " <td>0.6390</td>\n",
1926
+ " <td>0.6164</td>\n",
1927
+ " <td>0.6390</td>\n",
1928
+ " </tr>\n",
1929
+ " </tbody>\n",
1930
+ "</table>\n",
1931
+ "</div>"
1932
+ ],
1933
+ "text/plain": [
1934
+ "Split 1 2 3 4 5 Avg \\\n",
1935
+ "Model Scenario \n",
1936
+ "Decision Tree 0.05 0.4359 0.5449 0.6174 0.6594 0.5000 0.5515 \n",
1937
+ " 0.10 0.4231 0.5372 0.5465 0.5774 0.4764 0.5121 \n",
1938
+ " without 0.6083 0.6090 0.6129 0.6271 0.5000 0.5915 \n",
1939
+ "Logistic Regression 0.05 0.5077 0.4679 0.5768 0.5523 0.5867 0.5383 \n",
1940
+ " 0.10 0.5667 0.5231 0.5832 0.5368 0.5536 0.5527 \n",
1941
+ " without 0.5603 0.5808 0.6348 0.6845 0.6250 0.6171 \n",
1942
+ "Random Forest 0.05 0.3769 0.4333 0.6129 0.6400 0.6798 0.5486 \n",
1943
+ " 0.10 0.3949 0.4205 0.5316 0.6258 0.6033 0.5152 \n",
1944
+ " without 0.5160 0.6147 0.6277 0.6323 0.6078 0.5997 \n",
1945
+ "SVM 0.05 0.5333 0.4385 0.4974 0.4916 0.5587 0.5039 \n",
1946
+ " 0.10 0.5000 0.5321 0.4594 0.5058 0.5995 0.5193 \n",
1947
+ " without 0.5192 0.5718 0.3742 0.3155 0.3750 0.4311 \n",
1948
+ "XGBoost 0.05 0.5026 0.5051 0.5923 0.6335 0.5434 0.5554 \n",
1949
+ " 0.10 0.5000 0.4744 0.5458 0.6529 0.6059 0.5558 \n",
1950
+ " without 0.6321 0.5872 0.6058 0.6181 0.6390 0.6164 \n",
1951
+ "\n",
1952
+ "Split Max \n",
1953
+ "Model Scenario \n",
1954
+ "Decision Tree 0.05 0.6594 \n",
1955
+ " 0.10 0.5774 \n",
1956
+ " without 0.6271 \n",
1957
+ "Logistic Regression 0.05 0.5867 \n",
1958
+ " 0.10 0.5832 \n",
1959
+ " without 0.6845 \n",
1960
+ "Random Forest 0.05 0.6798 \n",
1961
+ " 0.10 0.6258 \n",
1962
+ " without 0.6323 \n",
1963
+ "SVM 0.05 0.5587 \n",
1964
+ " 0.10 0.5995 \n",
1965
+ " without 0.5718 \n",
1966
+ "XGBoost 0.05 0.6335 \n",
1967
+ " 0.10 0.6529 \n",
1968
+ " without 0.6390 "
1969
+ ]
1970
+ },
1971
+ "metadata": {},
1972
+ "output_type": "display_data"
1973
+ }
1974
+ ],
1975
+ "source": [
1976
+ "# ================================================================\n",
1977
+ "# Direction-of-Move Classification (Monthly Version, nested CV)\n",
1978
+ "# β€’ All input CSVs have first column \"Month\" (YYYY-MM)\n",
1979
+ "# β€’ HMM & LSTM discarded, XGBoost retained\n",
1980
+ "# β€’ Feature standardisation before model training\n",
1981
+ "# β€’ Accuracy, AUC, F1 tables + label-distribution table\n",
1982
+ "# ================================================================\n",
1983
+ "\n",
1984
+ "import pathlib, warnings, numpy as np, pandas as pd\n",
1985
+ "from statsmodels.tsa.stattools import adfuller, coint, grangercausalitytests\n",
1986
+ "from sklearn.model_selection import GridSearchCV, TimeSeriesSplit\n",
1987
+ "from sklearn.preprocessing import StandardScaler\n",
1988
+ "from sklearn.linear_model import LogisticRegression\n",
1989
+ "from sklearn.tree import DecisionTreeClassifier\n",
1990
+ "from sklearn.ensemble import RandomForestClassifier\n",
1991
+ "from sklearn.svm import SVC\n",
1992
+ "import xgboost as xgb\n",
1993
+ "from sklearn.metrics import accuracy_score, f1_score, roc_auc_score\n",
1994
+ "import matplotlib.pyplot as plt\n",
1995
+ "\n",
1996
+ "warnings.filterwarnings(\"ignore\")\n",
1997
+ "pd.set_option(\"display.float_format\", \"{:,.4f}\".format)\n",
1998
+ "np.random.seed(42)\n",
1999
+ "\n",
2000
+ "# ─────────────── 1β”‚ data ────────────────────────────────────────\n",
2001
+ "ROOT = pathlib.Path(\".\")\n",
2002
+ "\n",
2003
+ "def load_copper():\n",
2004
+ " \"\"\"Load monthly copper prices (Month,Price).\"\"\"\n",
2005
+ " return (pd.read_csv(ROOT / \"Copper Prices.csv\")\n",
2006
+ " .assign(Month=lambda d: pd.to_datetime(d[\"Month\"], format=\"%Y-%m\"))\n",
2007
+ " .set_index(\"Month\").asfreq(\"MS\") # monthly period-end\n",
2008
+ " .rename(columns={\"Price\": \"Copper_Price\"})[\"Copper_Price\"])\n",
2009
+ "\n",
2010
+ "def load_trends():\n",
2011
+ " \"\"\"Load all Google-Trend CSVs (Month,value) from sub-folders.\"\"\"\n",
2012
+ " def one(folder):\n",
2013
+ " frames = []\n",
2014
+ " for fp in (ROOT / folder).glob(\"*.csv\"):\n",
2015
+ " key = fp.stem.replace(\",\", \"\")\n",
2016
+ " t = pd.read_csv(fp)\n",
2017
+ " t.columns = [c.strip() for c in t.columns]\n",
2018
+ " frames.append(\n",
2019
+ " t.assign(Month=lambda d: pd.to_datetime(d[t.columns[0]], format=\"%Y-%m\"))\n",
2020
+ " .set_index(\"Month\").asfreq(\"MS\")\n",
2021
+ " .rename(columns={t.columns[1]: key})\n",
2022
+ " )\n",
2023
+ " return pd.concat(frames, axis=1) if frames else pd.DataFrame()\n",
2024
+ "\n",
2025
+ " cats = [\"Supply Factors\", \"Demand Factors\", \"Speculative Factors\", \"Sudden Factors\"]\n",
2026
+ " return pd.concat([one(c) for c in cats], axis=1).sort_index()\n",
2027
+ "\n",
2028
+ "copper, trends = load_copper(), load_trends()\n",
2029
+ "data_raw = pd.concat([copper, trends], axis=1).dropna()\n",
2030
+ "\n",
2031
+ "# ─────────────── 2β”‚ statistical filters (same as before) ───────\n",
2032
+ "def adf_p(s): return adfuller(s.dropna(), autolag=\"AIC\")[1]\n",
2033
+ "ADF, COINT, MAX_LAG = 0.10, 0.10, 18 # identical thresholds\n",
2034
+ "\n",
2035
+ "i1 = [c for c in data_raw.columns\n",
2036
+ " if adf_p(data_raw[c]) > ADF\n",
2037
+ " and adf_p(data_raw[c].diff()) < ADF\n",
2038
+ " and c != \"Copper_Price\"]\n",
2039
+ "cands = [s for s in i1 if coint(data_raw[\"Copper_Price\"], data_raw[s])[1] < COINT]\n",
2040
+ "\n",
2041
+ "minp = {s: min(grangercausalitytests(\n",
2042
+ " data_raw[[\"Copper_Price\", s]].dropna().values,\n",
2043
+ " maxlag=MAX_LAG, verbose=False)[lag][0][\"ssr_ftest\"][1]\n",
2044
+ " for lag in range(1, MAX_LAG + 1))\n",
2045
+ " for s in cands}\n",
2046
+ "\n",
2047
+ "TIERS = {0.05: [s for s, p in minp.items() if p < 0.05],\n",
2048
+ " 0.10: [s for s, p in minp.items() if p < 0.10]}\n",
2049
+ "\n",
2050
+ "def lag_df(feats, lag=1):\n",
2051
+ " \"\"\"Add one-month lag for price + selected features; drop NA.\"\"\"\n",
2052
+ " out = {\"Copper_Price\": data_raw[\"Copper_Price\"],\n",
2053
+ " f\"Copper_Price_lag{lag}\": data_raw[\"Copper_Price\"].shift(lag)}\n",
2054
+ " out.update({f\"{f}_lag{lag}\": data_raw[f].shift(lag) for f in feats})\n",
2055
+ " return pd.DataFrame(out).dropna()\n",
2056
+ "\n",
2057
+ "SCENS = {\"without\": lag_df([]),\n",
2058
+ " \"0.05\" : lag_df(TIERS[0.05]),\n",
2059
+ " \"0.10\" : lag_df(TIERS[0.10])}\n",
2060
+ "for k in SCENS:\n",
2061
+ " # label = direction of *next* month’s price change\n",
2062
+ " SCENS[k][\"y\"] = (SCENS[k][\"Copper_Price\"].diff().shift(-1) > 0).astype(int)\n",
2063
+ " SCENS[k].dropna(inplace=True)\n",
2064
+ "\n",
2065
+ "# ─────────────── 3β”‚ label-distribution table (5 splits) ─────────\n",
2066
+ "df_ref, n = SCENS[\"without\"], len(SCENS[\"without\"])\n",
2067
+ "TEST_FRAC = 0.22\n",
2068
+ "test_len = int(n * TEST_FRAC)\n",
2069
+ "\n",
2070
+ "rows = []\n",
2071
+ "for i in range(5):\n",
2072
+ " train_end = int(n * (0.80 + i * 0.05))\n",
2073
+ " tr, te = slice(0, train_end - test_len), slice(train_end - test_len, train_end)\n",
2074
+ " y_tr, y_te = df_ref[\"y\"].iloc[tr], df_ref[\"y\"].iloc[te]\n",
2075
+ " c_tr = y_tr.value_counts().reindex([0,1]).fillna(0).astype(int)\n",
2076
+ " c_te = y_te.value_counts().reindex([0,1]).fillna(0).astype(int)\n",
2077
+ " rows.append([i+1,\n",
2078
+ " c_tr[0], c_tr[1], c_tr[0]/c_tr.sum()*100, c_tr[1]/c_tr.sum()*100,\n",
2079
+ " c_te[0], c_te[1], c_te[0]/c_te.sum()*100, c_te[1]/c_te.sum()*100])\n",
2080
+ "\n",
2081
+ "cols = [\"Split\",\"Train 0\",\"Train 1\",\"Train 0 %\",\"Train 1 %\",\n",
2082
+ " \"Test 0\",\"Test 1\",\"Test 0 %\",\"Test 1 %\"]\n",
2083
+ "label_dist = (pd.DataFrame(rows, columns=cols)\n",
2084
+ " .set_index(\"Split\")\n",
2085
+ " .applymap(lambda x: f\"{x:.1f}%\" if isinstance(x, float) else x))\n",
2086
+ "print(\"\\n── Label distribution across five splits ──\")\n",
2087
+ "display(label_dist)\n",
2088
+ "\n",
2089
+ "# ─────────────── 4β”‚ model grids (unchanged) ─────────────────────\n",
2090
+ "GRIDS = {\n",
2091
+ " \"XGBoost\" : [{\"n_estimators\":[400,600],\n",
2092
+ " \"max_depth\":[3,5],\n",
2093
+ " \"learning_rate\":[0.03,0.07],\n",
2094
+ " \"subsample\":[0.8,1.0]}],\n",
2095
+ " \"Logistic Regression\":[{\"C\":[0.1,1,10]}],\n",
2096
+ " \"Decision Tree\":[{\"max_depth\":[3,5,8],\n",
2097
+ " \"min_samples_leaf\":[2,4,6]}],\n",
2098
+ " \"Random Forest\":[{\"n_estimators\":[300,500],\n",
2099
+ " \"max_depth\":[4,6],\n",
2100
+ " \"min_samples_leaf\":[3,5]}],\n",
2101
+ " \"SVM\":[{\"C\":[0.1,1,10],\"gamma\":[0.01,0.1]}],\n",
2102
+ "}\n",
2103
+ "\n",
2104
+ "# ─────────────── 5β”‚ expanding-window generator ──────────────────\n",
2105
+ "def expanding_splits(n_rows, test_frac=TEST_FRAC, n_splits=5):\n",
2106
+ " t_len = int(n_rows * test_frac)\n",
2107
+ " for i in range(n_splits):\n",
2108
+ " end = int(n_rows * (0.80 + i*0.05))\n",
2109
+ " yield np.arange(end - t_len), np.arange(end - t_len, end)\n",
2110
+ "\n",
2111
+ "INNER_CV = TimeSeriesSplit(n_splits=4)\n",
2112
+ "records = []\n",
2113
+ "\n",
2114
+ "# ─────────────── 6β”‚ nested CV loop ──────────────────────────────\n",
2115
+ "for scen, df in SCENS.items():\n",
2116
+ " X_full, y_full = df.drop(columns=[\"Copper_Price\",\"y\"]), df[\"y\"]\n",
2117
+ " n = len(X_full)\n",
2118
+ "\n",
2119
+ " for split_idx, (tr_idx, te_idx) in enumerate(expanding_splits(n), 1):\n",
2120
+ " X_tr_raw, y_tr = X_full.iloc[tr_idx], y_full.iloc[tr_idx]\n",
2121
+ " X_te_raw, y_te = X_full.iloc[te_idx], y_full.iloc[te_idx]\n",
2122
+ "\n",
2123
+ " scaler = StandardScaler().fit(X_tr_raw)\n",
2124
+ " X_tr = pd.DataFrame(scaler.transform(X_tr_raw), columns=X_tr_raw.columns, index=X_tr_raw.index)\n",
2125
+ " X_te = pd.DataFrame(scaler.transform(X_te_raw), columns=X_te_raw.columns, index=X_te_raw.index)\n",
2126
+ "\n",
2127
+ " counts = y_tr.value_counts()\n",
2128
+ "\n",
2129
+ " for mname, grid in GRIDS.items():\n",
2130
+ " if mname == \"Logistic Regression\":\n",
2131
+ " base = LogisticRegression(max_iter=1000, class_weight='balanced')\n",
2132
+ " elif mname == \"Decision Tree\":\n",
2133
+ " base = DecisionTreeClassifier(random_state=42, class_weight='balanced')\n",
2134
+ " elif mname == \"Random Forest\":\n",
2135
+ " base = RandomForestClassifier(random_state=42, class_weight='balanced')\n",
2136
+ " elif mname == \"SVM\":\n",
2137
+ " base = SVC(kernel=\"rbf\", probability=True, class_weight='balanced', random_state=42)\n",
2138
+ " elif mname == \"XGBoost\":\n",
2139
+ " spw = counts.get(0,1) / counts.get(1,1) if len(counts)==2 else 1\n",
2140
+ " base = xgb.XGBClassifier(random_state=42,\n",
2141
+ " objective='binary:logistic',\n",
2142
+ " eval_metric='logloss',\n",
2143
+ " use_label_encoder=False,\n",
2144
+ " scale_pos_weight=spw)\n",
2145
+ "\n",
2146
+ " best = (GridSearchCV(base, grid, cv=INNER_CV,\n",
2147
+ " scoring=\"accuracy\", n_jobs=-1)\n",
2148
+ " .fit(X_tr, y_tr)\n",
2149
+ " .best_estimator_)\n",
2150
+ "\n",
2151
+ " y_hat = best.predict(X_te)\n",
2152
+ " proba = best.predict_proba(X_te)[:,1] if hasattr(best,\"predict_proba\") else None\n",
2153
+ "\n",
2154
+ " acc = accuracy_score(y_te, y_hat)\n",
2155
+ " f1 = f1_score(y_te, y_hat, zero_division=0)\n",
2156
+ " auc = (roc_auc_score(y_te, proba)\n",
2157
+ " if proba is not None and len(np.unique(y_te))==2 else np.nan)\n",
2158
+ "\n",
2159
+ " records.append({\"Model\":mname,\"Scenario\":scen,\"Split\":split_idx,\n",
2160
+ " \"Accuracy\":acc,\"F1\":f1,\"AUC\":auc})\n",
2161
+ "\n",
2162
+ "# ─────────────── 7β”‚ summary tables ──────────────────────────────\n",
2163
+ "tbl = pd.DataFrame(records)\n",
2164
+ "\n",
2165
+ "def metric_tbl(metric, fmt):\n",
2166
+ " piv = tbl.pivot_table(index=[\"Model\",\"Scenario\"], columns=\"Split\", values=metric)\n",
2167
+ " piv[\"Avg\"] = piv.mean(axis=1)\n",
2168
+ " piv[\"Max\"] = piv[[1,2,3,4,5]].max(axis=1)\n",
2169
+ " return piv.applymap(fmt)\n",
2170
+ "\n",
2171
+ "pct = lambda x: f\"{x:.2%}\"\n",
2172
+ "au = lambda x: f\"{x:.4f}\"\n",
2173
+ "\n",
2174
+ "print(\"\\n── Accuracy per split (plus Avg & Max) ──\")\n",
2175
+ "display(metric_tbl(\"Accuracy\", pct))\n",
2176
+ "\n",
2177
+ "print(\"\\n── F1-score per split (plus Avg & Max) ──\")\n",
2178
+ "display(metric_tbl(\"F1\", pct))\n",
2179
+ "\n",
2180
+ "print(\"\\n── AUC per split (plus Avg & Max) ──\")\n",
2181
+ "display(metric_tbl(\"AUC\", au))\n"
2182
+ ]
2183
+ }
2184
+ ],
2185
+ "metadata": {
2186
+ "kernelspec": {
2187
+ "display_name": ".venv",
2188
+ "language": "python",
2189
+ "name": "python3"
2190
+ },
2191
+ "language_info": {
2192
+ "codemirror_mode": {
2193
+ "name": "ipython",
2194
+ "version": 3
2195
+ },
2196
+ "file_extension": ".py",
2197
+ "mimetype": "text/x-python",
2198
+ "name": "python",
2199
+ "nbconvert_exporter": "python",
2200
+ "pygments_lexer": "ipython3",
2201
+ "version": "3.12.11"
2202
+ }
2203
+ },
2204
+ "nbformat": 4,
2205
+ "nbformat_minor": 5
2206
+ }