hieptran318204 commited on
Commit
b649f54
·
verified ·
1 Parent(s): 1a46dab

Upload 4 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Report[[:space:]]Project.pdf filter=lfs diff=lfs merge=lfs -text
Report Project.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0e25158be5abecf74fa4bedf1d8557fa486d7dadddfead04d6720cc75ca2bea
3
+ size 833995
dataAnalysis_notebook.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
video_report_link ADDED
@@ -0,0 +1 @@
 
 
1
+ https://drive.google.com/file/d/19DcvTFzvZM6kIBqHA-f7PWvzziYKgjNu/view?usp=sharing
votingfinal.ipynb ADDED
@@ -0,0 +1,1910 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "082fc435",
6
+ "metadata": {
7
+ "papermill": {
8
+ "duration": 0.005415,
9
+ "end_time": "2024-12-22T02:22:33.724559",
10
+ "exception": false,
11
+ "start_time": "2024-12-22T02:22:33.719144",
12
+ "status": "completed"
13
+ },
14
+ "tags": []
15
+ },
16
+ "source": [
17
+ "# Extract:\n",
18
+ "Nhóm mình sử dụng Voting Regressor để voting các model chính: LightGBM, XGBoost và CatBoost.\n",
19
+ "\n",
20
+ "LightGBM, XGBoost và CatBoost là các mô hình dạng Gradient Boosting. Nói đại khái là sử dụng nhiều mô hình nhỏ học lần lượt. Mô hình sau sẽ cải tiến điểm yếu của mô hình trước. Và cuối cùng vẫn cho Voting các Model yếu để hoàn thiện mô hình một cách tối ưu."
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "id": "2616f6de",
26
+ "metadata": {
27
+ "papermill": {
28
+ "duration": 0.004335,
29
+ "end_time": "2024-12-22T02:22:33.733527",
30
+ "exception": false,
31
+ "start_time": "2024-12-22T02:22:33.729192",
32
+ "status": "completed"
33
+ },
34
+ "tags": []
35
+ },
36
+ "source": [
37
+ "# Thêm các thư viện cần thiết"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "code",
42
+ "execution_count": 1,
43
+ "id": "d22f42fc",
44
+ "metadata": {
45
+ "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
46
+ "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
47
+ "execution": {
48
+ "iopub.execute_input": "2024-12-22T02:22:33.743698Z",
49
+ "iopub.status.busy": "2024-12-22T02:22:33.743312Z",
50
+ "iopub.status.idle": "2024-12-22T02:22:49.437492Z",
51
+ "shell.execute_reply": "2024-12-22T02:22:49.436542Z"
52
+ },
53
+ "papermill": {
54
+ "duration": 15.701408,
55
+ "end_time": "2024-12-22T02:22:49.439287",
56
+ "exception": false,
57
+ "start_time": "2024-12-22T02:22:33.737879",
58
+ "status": "completed"
59
+ },
60
+ "tags": []
61
+ },
62
+ "outputs": [],
63
+ "source": [
64
+ "import numpy as np\n",
65
+ "import pandas as pd\n",
66
+ "import os\n",
67
+ "import re\n",
68
+ "from sklearn.base import clone\n",
69
+ "from sklearn.metrics import cohen_kappa_score\n",
70
+ "from sklearn.model_selection import StratifiedKFold\n",
71
+ "from scipy.optimize import minimize\n",
72
+ "from concurrent.futures import ThreadPoolExecutor\n",
73
+ "from tqdm import tqdm\n",
74
+ "import polars as pl\n",
75
+ "import polars.selectors as cs\n",
76
+ "import matplotlib.pyplot as plt\n",
77
+ "from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter\n",
78
+ "import seaborn as sns\n",
79
+ "\n",
80
+ "from sklearn.preprocessing import StandardScaler\n",
81
+ "import matplotlib.pyplot as plt\n",
82
+ "from keras.models import Model\n",
83
+ "from keras.layers import Input, Dense\n",
84
+ "from keras.optimizers import Adam\n",
85
+ "import torch\n",
86
+ "import torch.nn as nn\n",
87
+ "import torch.optim as optim\n",
88
+ "\n",
89
+ "from colorama import Fore, Style\n",
90
+ "from IPython.display import clear_output\n",
91
+ "import warnings\n",
92
+ "from lightgbm import LGBMRegressor\n",
93
+ "from xgboost import XGBRegressor\n",
94
+ "from catboost import CatBoostRegressor\n",
95
+ "from sklearn.ensemble import VotingRegressor, RandomForestRegressor, GradientBoostingRegressor\n",
96
+ "from sklearn.impute import SimpleImputer, KNNImputer\n",
97
+ "from sklearn.pipeline import Pipeline\n",
98
+ "warnings.filterwarnings('ignore')\n",
99
+ "pd.options.display.max_columns = None"
100
+ ]
101
+ },
102
+ {
103
+ "cell_type": "markdown",
104
+ "id": "721e9104",
105
+ "metadata": {
106
+ "papermill": {
107
+ "duration": 0.004393,
108
+ "end_time": "2024-12-22T02:22:49.451298",
109
+ "exception": false,
110
+ "start_time": "2024-12-22T02:22:49.446905",
111
+ "status": "completed"
112
+ },
113
+ "tags": []
114
+ },
115
+ "source": [
116
+ "# Xử lý dữ liệu\n",
117
+ "\n"
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": 2,
123
+ "id": "c27d5918",
124
+ "metadata": {
125
+ "execution": {
126
+ "iopub.execute_input": "2024-12-22T02:22:49.461518Z",
127
+ "iopub.status.busy": "2024-12-22T02:22:49.460838Z",
128
+ "iopub.status.idle": "2024-12-22T02:22:49.466795Z",
129
+ "shell.execute_reply": "2024-12-22T02:22:49.466144Z"
130
+ },
131
+ "papermill": {
132
+ "duration": 0.012195,
133
+ "end_time": "2024-12-22T02:22:49.468011",
134
+ "exception": false,
135
+ "start_time": "2024-12-22T02:22:49.455816",
136
+ "status": "completed"
137
+ },
138
+ "tags": []
139
+ },
140
+ "outputs": [],
141
+ "source": [
142
+ "# Tiền xử lý dữ liệu\n",
143
+ "def data_preprocessing(data):\n",
144
+ " \n",
145
+ " # Loại bỏ các cột chứa Season\n",
146
+ " season_cols = [col for col in data.columns if 'Season' in col]\n",
147
+ " data = data.drop(season_cols, axis=1)\n",
148
+ " \n",
149
+ " # Tạo một số feature mới hữu dụng\n",
150
+ " data['BMI_Age'] = data['Physical-BMI'] * data['Basic_Demos-Age']\n",
151
+ " data['Internet_Hours_Age'] = data['PreInt_EduHx-computerinternet_hoursday'] * data['Basic_Demos-Age']\n",
152
+ " data['BMI_Internet_Hours'] = data['Physical-BMI'] * data['PreInt_EduHx-computerinternet_hoursday']\n",
153
+ " data['BFP_BMI'] = data['BIA-BIA_Fat'] / data['BIA-BIA_BMI']\n",
154
+ " data['FFMI_BFP'] = data['BIA-BIA_FFMI'] / data['BIA-BIA_Fat']\n",
155
+ " data['FMI_BFP'] = data['BIA-BIA_FMI'] / data['BIA-BIA_Fat']\n",
156
+ " data['LST_TBW'] = data['BIA-BIA_LST'] / data['BIA-BIA_TBW']\n",
157
+ " data['BFP_BMR'] = data['BIA-BIA_Fat'] * data['BIA-BIA_BMR']\n",
158
+ " data['BFP_DEE'] = data['BIA-BIA_Fat'] * data['BIA-BIA_DEE']\n",
159
+ " data['BMR_Weight'] = data['BIA-BIA_BMR'] / data['Physical-Weight']\n",
160
+ " data['DEE_Weight'] = data['BIA-BIA_DEE'] / data['Physical-Weight']\n",
161
+ " data['SMM_Height'] = data['BIA-BIA_SMM'] / data['Physical-Height']\n",
162
+ " data['Muscle_to_Fat'] = data['BIA-BIA_SMM'] / data['BIA-BIA_FMI']\n",
163
+ " data['Hydration_Status'] = data['BIA-BIA_TBW'] / data['Physical-Weight']\n",
164
+ " data['ICW_TBW'] = data['BIA-BIA_ICW'] / data['BIA-BIA_TBW']\n",
165
+ " \n",
166
+ " return data"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "execution_count": 3,
172
+ "id": "a3e9e099",
173
+ "metadata": {
174
+ "execution": {
175
+ "iopub.execute_input": "2024-12-22T02:22:49.477745Z",
176
+ "iopub.status.busy": "2024-12-22T02:22:49.477500Z",
177
+ "iopub.status.idle": "2024-12-22T02:22:49.482339Z",
178
+ "shell.execute_reply": "2024-12-22T02:22:49.481716Z"
179
+ },
180
+ "papermill": {
181
+ "duration": 0.011158,
182
+ "end_time": "2024-12-22T02:22:49.483650",
183
+ "exception": false,
184
+ "start_time": "2024-12-22T02:22:49.472492",
185
+ "status": "completed"
186
+ },
187
+ "tags": []
188
+ },
189
+ "outputs": [],
190
+ "source": [
191
+ "# Đọc và xử lý dữ liệu parquet\n",
192
+ "def process_parquet_file(file_name, file_path):\n",
193
+ " df = pd.read_parquet(os.path.join(file_path, file_name, 'part-0.parquet'))\n",
194
+ " df.drop('step', axis=1, inplace=True)\n",
195
+ " return df.describe().values.reshape(-1), file_name.split('=')[1]\n",
196
+ "\n",
197
+ "def load_parquet_file(file_path) -> pd.DataFrame:\n",
198
+ " # Liệt kê các tệp\n",
199
+ " file_list = os.listdir(file_path)\n",
200
+ " \n",
201
+ " # ThreadPool hỗ trợ xử lý đa luồng\n",
202
+ " with ThreadPoolExecutor() as executor:\n",
203
+ " results = list(tqdm(executor.map(lambda fname: process_parquet_file(fname, file_path), file_list), total=len(file_list)))\n",
204
+ " \n",
205
+ " # Trả về thống kê và các chỉ số\n",
206
+ " stats, indexes = zip(*results)\n",
207
+ " \n",
208
+ " df = pd.DataFrame(stats, columns=[f\"stat_{i}\" for i in range(len(stats[0]))])\n",
209
+ " df['id'] = indexes\n",
210
+ " return df"
211
+ ]
212
+ },
213
+ {
214
+ "cell_type": "code",
215
+ "execution_count": 4,
216
+ "id": "f9609c25",
217
+ "metadata": {
218
+ "execution": {
219
+ "iopub.execute_input": "2024-12-22T02:22:49.494124Z",
220
+ "iopub.status.busy": "2024-12-22T02:22:49.493853Z",
221
+ "iopub.status.idle": "2024-12-22T02:22:49.502960Z",
222
+ "shell.execute_reply": "2024-12-22T02:22:49.502215Z"
223
+ },
224
+ "papermill": {
225
+ "duration": 0.016439,
226
+ "end_time": "2024-12-22T02:22:49.504500",
227
+ "exception": false,
228
+ "start_time": "2024-12-22T02:22:49.488061",
229
+ "status": "completed"
230
+ },
231
+ "tags": []
232
+ },
233
+ "outputs": [],
234
+ "source": [
235
+ "# Mã hóa dữ liệu sử dụng AutoEncoder\n",
236
+ "class AutoEncoder(nn.Module):\n",
237
+ " def __init__(self, input_dimen, encode_dimen):\n",
238
+ " super(AutoEncoder, self).__init__()\n",
239
+ " self.encoder = nn.Sequential(\n",
240
+ " nn.Linear(input_dimen, encode_dimen*3),\n",
241
+ " nn.ReLU(),\n",
242
+ " nn.Linear(encode_dimen*3, encode_dimen*2),\n",
243
+ " nn.ReLU(),\n",
244
+ " nn.Linear(encode_dimen*2, encode_dimen),\n",
245
+ " nn.ReLU()\n",
246
+ " )\n",
247
+ " self.decoder = nn.Sequential(\n",
248
+ " nn.Linear(encode_dimen, input_dimen*2),\n",
249
+ " nn.ReLU(),\n",
250
+ " nn.Linear(input_dimen*2, input_dimen*3),\n",
251
+ " nn.ReLU(),\n",
252
+ " nn.Linear(input_dimen*3, input_dimen),\n",
253
+ " nn.Sigmoid()\n",
254
+ " )\n",
255
+ " \n",
256
+ " def forward(self, x):\n",
257
+ " encoded = self.encoder(x)\n",
258
+ " decoded = self.decoder(encoded)\n",
259
+ " return decoded\n",
260
+ "\n",
261
+ "# Mã hóa dữ liệu về 50 chiều\n",
262
+ "def perform_autoencoder(df, encoding_dim=50, epochs=50, batch_size=32):\n",
263
+ " # Chuẩn hóa dữ liệu: đưa về z (trung bình = 0, phương sai = 1)\n",
264
+ " scaler = StandardScaler()\n",
265
+ " df_scaled = scaler.fit_transform(df)\n",
266
+ " \n",
267
+ " # Chuyển dữ liệu đã chuẩn hóa sang dạng tensor để sử dụng trong mô hình NN\n",
268
+ " data_tensor = torch.FloatTensor(df_scaled)\n",
269
+ " \n",
270
+ " # Khởi tạo AutoEncoder\n",
271
+ " input_dim = data_tensor.shape[1]\n",
272
+ " autoencoder = AutoEncoder(input_dim, encoding_dim)\n",
273
+ " \n",
274
+ " # Cài đặt hàm mất mát và tối ưu\n",
275
+ " criterion = nn.MSELoss()\n",
276
+ " optimizer = optim.Adam(autoencoder.parameters())\n",
277
+ " \n",
278
+ " # Huấn luyện mô hình Encoder\n",
279
+ " for epoch in range(epochs):\n",
280
+ " for i in range(0, len(data_tensor), batch_size):\n",
281
+ " batch = data_tensor[i : i + batch_size]\n",
282
+ " optimizer.zero_grad()\n",
283
+ " reconstructed = autoencoder(batch)\n",
284
+ " loss = criterion(reconstructed, batch)\n",
285
+ " loss.backward()\n",
286
+ " optimizer.step()\n",
287
+ " \n",
288
+ " # Sau mỗi 10 epoch, in ra Loss để theo dõi\n",
289
+ " if (epoch + 1) % 10 == 0:\n",
290
+ " print(f'Epoch thứ [{epoch + 1}/{epochs}], Loss = {loss.item():.4f}]')\n",
291
+ " # Lấy dữ liệu đã được mã hóa & chuyển thành dataframe \n",
292
+ " with torch.no_grad():\n",
293
+ " encoded_data = autoencoder.encoder(data_tensor).numpy()\n",
294
+ " \n",
295
+ " df_encoded = pd.DataFrame(encoded_data, columns=[f'Enc_{i + 1}' for i in range(encoded_data.shape[1])])\n",
296
+ " \n",
297
+ " return df_encoded"
298
+ ]
299
+ },
300
+ {
301
+ "cell_type": "markdown",
302
+ "id": "6f9e4c6c",
303
+ "metadata": {
304
+ "papermill": {
305
+ "duration": 0.00949,
306
+ "end_time": "2024-12-22T02:22:49.520476",
307
+ "exception": false,
308
+ "start_time": "2024-12-22T02:22:49.510986",
309
+ "status": "completed"
310
+ },
311
+ "tags": []
312
+ },
313
+ "source": [
314
+ "# HÀM MÔ HÌNH HUẤN LUYỆN"
315
+ ]
316
+ },
317
+ {
318
+ "cell_type": "code",
319
+ "execution_count": 5,
320
+ "id": "fca86cc4",
321
+ "metadata": {
322
+ "execution": {
323
+ "iopub.execute_input": "2024-12-22T02:22:49.535637Z",
324
+ "iopub.status.busy": "2024-12-22T02:22:49.535246Z",
325
+ "iopub.status.idle": "2024-12-22T02:22:49.540231Z",
326
+ "shell.execute_reply": "2024-12-22T02:22:49.538983Z"
327
+ },
328
+ "papermill": {
329
+ "duration": 0.015813,
330
+ "end_time": "2024-12-22T02:22:49.542828",
331
+ "exception": false,
332
+ "start_time": "2024-12-22T02:22:49.527015",
333
+ "status": "completed"
334
+ },
335
+ "tags": []
336
+ },
337
+ "outputs": [],
338
+ "source": [
339
+ "SEED = 42\n",
340
+ "n_splits = 5"
341
+ ]
342
+ },
343
+ {
344
+ "cell_type": "code",
345
+ "execution_count": 6,
346
+ "id": "9c1b3948",
347
+ "metadata": {
348
+ "execution": {
349
+ "iopub.execute_input": "2024-12-22T02:22:49.559292Z",
350
+ "iopub.status.busy": "2024-12-22T02:22:49.559028Z",
351
+ "iopub.status.idle": "2024-12-22T02:22:49.564024Z",
352
+ "shell.execute_reply": "2024-12-22T02:22:49.563175Z"
353
+ },
354
+ "papermill": {
355
+ "duration": 0.013845,
356
+ "end_time": "2024-12-22T02:22:49.565786",
357
+ "exception": false,
358
+ "start_time": "2024-12-22T02:22:49.551941",
359
+ "status": "completed"
360
+ },
361
+ "tags": []
362
+ },
363
+ "outputs": [],
364
+ "source": [
365
+ "# Khởi tạo hàm và tính điểm kappa\n",
366
+ "def quadratic_weighted_kappa(y_true, y_pred):\n",
367
+ " return cohen_kappa_score(y_true, y_pred, weights='quadratic')\n",
368
+ "def evaluate_predictions(thresholds, y_true, oof_non_rounded):\n",
369
+ " rounded_p = threshold_Rounder(oof_non_rounded, thresholds)\n",
370
+ " return -quadratic_weighted_kappa(y_true, rounded_p)\n",
371
+ "\n",
372
+ "# Làm tròn giá trị dự đoán\n",
373
+ "def threshold_Rounder(oof_non_rounded, thresholds):\n",
374
+ " return np.where(oof_non_rounded < thresholds[0], 0,\n",
375
+ " np.where(oof_non_rounded < thresholds[1], 1,\n",
376
+ " np.where(oof_non_rounded < thresholds[2], 2, 3)))"
377
+ ]
378
+ },
379
+ {
380
+ "cell_type": "markdown",
381
+ "id": "1850886e",
382
+ "metadata": {
383
+ "papermill": {
384
+ "duration": 0.009357,
385
+ "end_time": "2024-12-22T02:22:49.582352",
386
+ "exception": false,
387
+ "start_time": "2024-12-22T02:22:49.572995",
388
+ "status": "completed"
389
+ },
390
+ "tags": []
391
+ },
392
+ "source": [
393
+ "Thực hiện huấn luyện và đánh giá mô hình. Trọng tâm hàm là tính toán điểm số Quadratic Weighted Kappa (QWK) và tối ưu hóa bằng Neler-Mead"
394
+ ]
395
+ },
396
+ {
397
+ "cell_type": "code",
398
+ "execution_count": 7,
399
+ "id": "74880719",
400
+ "metadata": {
401
+ "execution": {
402
+ "iopub.execute_input": "2024-12-22T02:22:49.595047Z",
403
+ "iopub.status.busy": "2024-12-22T02:22:49.594782Z",
404
+ "iopub.status.idle": "2024-12-22T02:22:49.602816Z",
405
+ "shell.execute_reply": "2024-12-22T02:22:49.602128Z"
406
+ },
407
+ "papermill": {
408
+ "duration": 0.014679,
409
+ "end_time": "2024-12-22T02:22:49.604069",
410
+ "exception": false,
411
+ "start_time": "2024-12-22T02:22:49.589390",
412
+ "status": "completed"
413
+ },
414
+ "tags": []
415
+ },
416
+ "outputs": [],
417
+ "source": [
418
+ "def TrainingModel(model_class, test_data):\n",
419
+ " X = train.drop(['sii'], axis=1)\n",
420
+ " y = train['sii']\n",
421
+ "\n",
422
+ " SKF = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)\n",
423
+ " \n",
424
+ " train_S = []\n",
425
+ " test_S = []\n",
426
+ " \n",
427
+ " oof_non_rounded = np.zeros(len(y), dtype=float) \n",
428
+ " oof_rounded = np.zeros(len(y), dtype=int) \n",
429
+ " test_preds = np.zeros((len(test_data), n_splits))\n",
430
+ "\n",
431
+ " for fold, (train_idx, test_idx) in enumerate(tqdm(SKF.split(X, y), desc=\"Training Folds\", total=n_splits)):\n",
432
+ " X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]\n",
433
+ " y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]\n",
434
+ "\n",
435
+ " model = clone(model_class)\n",
436
+ " model.fit(X_train, y_train)\n",
437
+ "\n",
438
+ " y_train_pred = model.predict(X_train)\n",
439
+ " y_val_pred = model.predict(X_val)\n",
440
+ "\n",
441
+ " oof_non_rounded[test_idx] = y_val_pred\n",
442
+ " y_val_pred_rounded = y_val_pred.round(0).astype(int)\n",
443
+ " oof_rounded[test_idx] = y_val_pred_rounded\n",
444
+ "\n",
445
+ " train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int))\n",
446
+ " val_kappa = quadratic_weighted_kappa(y_val, y_val_pred_rounded)\n",
447
+ "\n",
448
+ " train_S.append(train_kappa)\n",
449
+ " test_S.append(val_kappa)\n",
450
+ " \n",
451
+ " test_preds[:, fold] = model.predict(test_data)\n",
452
+ " \n",
453
+ " print(f\"Fold {fold+1} - Train QWK: {train_kappa:.4f}, Test QWK: {val_kappa:.4f}\")\n",
454
+ " clear_output(wait=True)\n",
455
+ "\n",
456
+ " print(f\"QWK TB train --> {np.mean(train_S):.4f}\")\n",
457
+ " print(f\"QWK TB test ---> {np.mean(test_S):.4f}\")\n",
458
+ "\n",
459
+ " KappaOPtimizer = minimize(evaluate_predictions,\n",
460
+ " x0=[0.5, 1.5, 2.5], args=(y, oof_non_rounded), \n",
461
+ " method='Nelder-Mead')\n",
462
+ " assert KappaOPtimizer.success, \"Tối ưu không hội tụ.\"\n",
463
+ " \n",
464
+ " oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x)\n",
465
+ " tKappa = quadratic_weighted_kappa(y, oof_tuned)\n",
466
+ "\n",
467
+ " print(f\"----> || Điểm QWK đã tối ưu :: {Fore.CYAN}{Style.BRIGHT} {tKappa:.3f}{Style.RESET_ALL}\")\n",
468
+ "\n",
469
+ " tpm = test_preds.mean(axis=1)\n",
470
+ " tpTuned = threshold_Rounder(tpm, KappaOPtimizer.x)\n",
471
+ " \n",
472
+ " submission = pd.DataFrame({\n",
473
+ " 'id': sample['id'],\n",
474
+ " 'sii': tpTuned\n",
475
+ " })\n",
476
+ "\n",
477
+ " return submission"
478
+ ]
479
+ },
480
+ {
481
+ "cell_type": "markdown",
482
+ "id": "72251ff2",
483
+ "metadata": {
484
+ "papermill": {
485
+ "duration": 0.004411,
486
+ "end_time": "2024-12-22T02:22:49.613290",
487
+ "exception": false,
488
+ "start_time": "2024-12-22T02:22:49.608879",
489
+ "status": "completed"
490
+ },
491
+ "tags": []
492
+ },
493
+ "source": [
494
+ "Hiệu chỉnh tham số cho các mô hình sử dụng"
495
+ ]
496
+ },
497
+ {
498
+ "cell_type": "code",
499
+ "execution_count": 8,
500
+ "id": "79977149",
501
+ "metadata": {
502
+ "execution": {
503
+ "iopub.execute_input": "2024-12-22T02:22:49.624145Z",
504
+ "iopub.status.busy": "2024-12-22T02:22:49.623908Z",
505
+ "iopub.status.idle": "2024-12-22T02:22:49.628867Z",
506
+ "shell.execute_reply": "2024-12-22T02:22:49.628127Z"
507
+ },
508
+ "papermill": {
509
+ "duration": 0.011969,
510
+ "end_time": "2024-12-22T02:22:49.630218",
511
+ "exception": false,
512
+ "start_time": "2024-12-22T02:22:49.618249",
513
+ "status": "completed"
514
+ },
515
+ "tags": []
516
+ },
517
+ "outputs": [],
518
+ "source": [
519
+ "# LightGBM\n",
520
+ "Params = {\n",
521
+ " 'learning_rate': 0.046,\n",
522
+ " 'max_depth': 12,\n",
523
+ " 'num_leaves': 478,\n",
524
+ " 'min_data_in_leaf': 13,\n",
525
+ " 'feature_fraction': 0.893,\n",
526
+ " 'bagging_fraction': 0.784,\n",
527
+ " 'bagging_freq': 4,\n",
528
+ " 'lambda_l1': 10, \n",
529
+ " 'lambda_l2': 0.01, \n",
530
+ " 'random_state': SEED,\n",
531
+ " 'verbose': -1,\n",
532
+ " 'n_estimator': 300,\n",
533
+ " 'device': 'gpu'\n",
534
+ "\n",
535
+ "}\n",
536
+ "\n",
537
+ "\n",
538
+ "# XGBoost \n",
539
+ "XGB_Params = {\n",
540
+ " 'learning_rate': 0.05,\n",
541
+ " 'max_depth': 6,\n",
542
+ " 'n_estimators': 200,\n",
543
+ " 'subsample': 0.8,\n",
544
+ " 'colsample_bytree': 0.8,\n",
545
+ " 'reg_alpha': 1, \n",
546
+ " 'reg_lambda': 5, \n",
547
+ " 'random_state': SEED,\n",
548
+ " 'tree_method': 'gpu_hist',\n",
549
+ "\n",
550
+ "}\n",
551
+ "\n",
552
+ "# CatBoost\n",
553
+ "CatBoost_Params = {\n",
554
+ " 'learning_rate': 0.05,\n",
555
+ " 'depth': 6,\n",
556
+ " 'iterations': 200,\n",
557
+ " 'random_seed': SEED,\n",
558
+ " 'verbose': 0,\n",
559
+ " 'l2_leaf_reg': 10, \n",
560
+ " 'task_type': 'GPU'\n",
561
+ "\n",
562
+ "}"
563
+ ]
564
+ },
565
+ {
566
+ "cell_type": "code",
567
+ "execution_count": 9,
568
+ "id": "dd66629e",
569
+ "metadata": {
570
+ "execution": {
571
+ "iopub.execute_input": "2024-12-22T02:22:49.640960Z",
572
+ "iopub.status.busy": "2024-12-22T02:22:49.640751Z",
573
+ "iopub.status.idle": "2024-12-22T02:22:49.647155Z",
574
+ "shell.execute_reply": "2024-12-22T02:22:49.646494Z"
575
+ },
576
+ "papermill": {
577
+ "duration": 0.01302,
578
+ "end_time": "2024-12-22T02:22:49.648386",
579
+ "exception": false,
580
+ "start_time": "2024-12-22T02:22:49.635366",
581
+ "status": "completed"
582
+ },
583
+ "tags": []
584
+ },
585
+ "outputs": [],
586
+ "source": [
587
+ "Light = LGBMRegressor(**Params)\n",
588
+ "XGB_Model = XGBRegressor(**XGB_Params)\n",
589
+ "CatBoost_Model = CatBoostRegressor(**CatBoost_Params)\n",
590
+ "\n",
591
+ "voting_model = VotingRegressor(estimators=[\n",
592
+ " ('lightgbm', Light),\n",
593
+ " ('xgboost', XGB_Model),\n",
594
+ " ('catboost', CatBoost_Model)\n",
595
+ "])"
596
+ ]
597
+ },
598
+ {
599
+ "cell_type": "markdown",
600
+ "id": "b7c14723",
601
+ "metadata": {
602
+ "papermill": {
603
+ "duration": 0.004132,
604
+ "end_time": "2024-12-22T02:22:49.656866",
605
+ "exception": false,
606
+ "start_time": "2024-12-22T02:22:49.652734",
607
+ "status": "completed"
608
+ },
609
+ "tags": []
610
+ },
611
+ "source": [
612
+ "# Submission 1"
613
+ ]
614
+ },
615
+ {
616
+ "cell_type": "code",
617
+ "execution_count": 10,
618
+ "id": "db5f5da4",
619
+ "metadata": {
620
+ "execution": {
621
+ "iopub.execute_input": "2024-12-22T02:22:49.666728Z",
622
+ "iopub.status.busy": "2024-12-22T02:22:49.666480Z",
623
+ "iopub.status.idle": "2024-12-22T02:24:18.483837Z",
624
+ "shell.execute_reply": "2024-12-22T02:24:18.483067Z"
625
+ },
626
+ "papermill": {
627
+ "duration": 88.824157,
628
+ "end_time": "2024-12-22T02:24:18.485372",
629
+ "exception": false,
630
+ "start_time": "2024-12-22T02:22:49.661215",
631
+ "status": "completed"
632
+ },
633
+ "tags": []
634
+ },
635
+ "outputs": [
636
+ {
637
+ "name": "stderr",
638
+ "output_type": "stream",
639
+ "text": [
640
+ "100%|██████████| 996/996 [01:09<00:00, 14.38it/s]\n",
641
+ "100%|██████████| 2/2 [00:00<00:00, 9.94it/s]\n"
642
+ ]
643
+ },
644
+ {
645
+ "name": "stdout",
646
+ "output_type": "stream",
647
+ "text": [
648
+ "Epoch thứ [10/100], Loss = 1.6273]\n",
649
+ "Epoch thứ [20/100], Loss = 1.5442]\n",
650
+ "Epoch thứ [30/100], Loss = 1.5088]\n",
651
+ "Epoch thứ [40/100], Loss = 1.5025]\n",
652
+ "Epoch thứ [50/100], Loss = 1.5003]\n",
653
+ "Epoch thứ [60/100], Loss = 1.4989]\n",
654
+ "Epoch thứ [70/100], Loss = 1.3855]\n",
655
+ "Epoch thứ [80/100], Loss = 1.3827]\n",
656
+ "Epoch thứ [90/100], Loss = 1.3842]\n",
657
+ "Epoch thứ [100/100], Loss = 1.3826]\n",
658
+ "Epoch thứ [10/100], Loss = 1.0255]\n",
659
+ "Epoch thứ [20/100], Loss = 0.6005]\n",
660
+ "Epoch thứ [30/100], Loss = 0.4271]\n",
661
+ "Epoch thứ [40/100], Loss = 0.4271]\n",
662
+ "Epoch thứ [50/100], Loss = 0.4271]\n",
663
+ "Epoch thứ [60/100], Loss = 0.4271]\n",
664
+ "Epoch thứ [70/100], Loss = 0.4271]\n",
665
+ "Epoch thứ [80/100], Loss = 0.4271]\n",
666
+ "Epoch thứ [90/100], Loss = 0.4271]\n",
667
+ "Epoch thứ [100/100], Loss = 0.4271]\n"
668
+ ]
669
+ }
670
+ ],
671
+ "source": [
672
+ "# Đọc các bảng dữ liệu\n",
673
+ "train = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')\n",
674
+ "test = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/test.csv')\n",
675
+ "sample = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv')\n",
676
+ "\n",
677
+ "train_pq = load_parquet_file(\"/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet\")\n",
678
+ "test_pq = load_parquet_file(\"/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet\")\n",
679
+ "\n",
680
+ "# Xóa cột id\n",
681
+ "df_train = train_pq.drop('id', axis=1)\n",
682
+ "df_test = test_pq.drop('id', axis=1)\n",
683
+ "\n",
684
+ "# Encode tập parquet\n",
685
+ "train_pq_encoded = perform_autoencoder(df_train, encoding_dim=60, epochs=100, batch_size=32)\n",
686
+ "test_pq_encoded = perform_autoencoder(df_test, encoding_dim=60, epochs=100, batch_size=32)\n",
687
+ "\n",
688
+ "# Danh sách các cột parquet đã mã hóa\n",
689
+ "parquet_cols = train_pq_encoded.columns.tolist()\n",
690
+ "\n",
691
+ "# Gán id vào dữ liệu đã encode\n",
692
+ "train_pq_encoded[\"id\"]=train_pq[\"id\"]\n",
693
+ "test_pq_encoded['id']=test_pq[\"id\"]\n",
694
+ "\n",
695
+ "# Kết hợp dữ liệu đã mã hóa vào tập huấn luyện\n",
696
+ "train = pd.merge(train, train_pq_encoded, how=\"left\", on='id')\n",
697
+ "test = pd.merge(test, test_pq_encoded, how=\"left\", on='id')\n",
698
+ "# Dùng K-Nearest Neighbors điền các giá trị thiếu\n",
699
+ "imputer = KNNImputer(n_neighbors=5)\n",
700
+ "numeric_cols = train.select_dtypes(include=['float64', 'int64']).columns\n",
701
+ "imputed_data = imputer.fit_transform(train[numeric_cols])\n",
702
+ "train_imputed = pd.DataFrame(imputed_data, columns=numeric_cols)\n",
703
+ "train_imputed['sii'] = train_imputed['sii'].round().astype(int)\n",
704
+ "for col in train.columns:\n",
705
+ " if col not in numeric_cols:\n",
706
+ " train_imputed[col] = train[col]\n",
707
+ " \n",
708
+ "train = train_imputed\n",
709
+ "\n",
710
+ "# Tiến hành tiền xử lý dữ liệu cho tập train và test\n",
711
+ "train = data_preprocessing(train)\n",
712
+ "test = data_preprocessing(test)\n",
713
+ "\n",
714
+ "# Hàng nào ít hơn 10 giá trị hợp lệ thì bỏ \n",
715
+ "train = train.dropna(thresh=10, axis=0)\n",
716
+ "\n",
717
+ "# Xóa cột id\n",
718
+ "train = train.drop('id', axis=1)\n",
719
+ "test = test .drop('id', axis=1) \n",
720
+ "\n",
721
+ "# Xác định các cột đặc trưng cho tập train và tập test\n",
722
+ "trainingCols = ['Basic_Demos-Age', 'Basic_Demos-Sex',\n",
723
+ " 'CGAS-CGAS_Score', 'Physical-BMI',\n",
724
+ " 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',\n",
725
+ " 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',\n",
726
+ " 'Fitness_Endurance-Max_Stage',\n",
727
+ " 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',\n",
728
+ " 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',\n",
729
+ " 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',\n",
730
+ " 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',\n",
731
+ " 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone',\n",
732
+ " 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',\n",
733
+ " 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',\n",
734
+ " 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',\n",
735
+ " 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',\n",
736
+ " 'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total',\n",
737
+ " 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw',\n",
738
+ " 'SDS-SDS_Total_T',\n",
739
+ " 'PreInt_EduHx-computerinternet_hoursday', 'sii', 'BMI_Age','Internet_Hours_Age','BMI_Internet_Hours',\n",
740
+ " 'BFP_BMI', 'FFMI_BFP', 'FMI_BFP', 'LST_TBW', 'BFP_BMR', 'BFP_DEE', 'BMR_Weight', 'DEE_Weight',\n",
741
+ " 'SMM_Height', 'Muscle_to_Fat', 'Hydration_Status', 'ICW_TBW']\n",
742
+ "testingCols = ['Basic_Demos-Age', 'Basic_Demos-Sex',\n",
743
+ " 'CGAS-CGAS_Score', 'Physical-BMI',\n",
744
+ " 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',\n",
745
+ " 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',\n",
746
+ " 'Fitness_Endurance-Max_Stage',\n",
747
+ " 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',\n",
748
+ " 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',\n",
749
+ " 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',\n",
750
+ " 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',\n",
751
+ " 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone',\n",
752
+ " 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',\n",
753
+ " 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',\n",
754
+ " 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',\n",
755
+ " 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',\n",
756
+ " 'BIA-BIA_TBW', 'PAQ_A-PAQ_A_Total',\n",
757
+ " 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw',\n",
758
+ " 'SDS-SDS_Total_T',\n",
759
+ " 'PreInt_EduHx-computerinternet_hoursday', 'BMI_Age','Internet_Hours_Age','BMI_Internet_Hours',\n",
760
+ " 'BFP_BMI', 'FFMI_BFP', 'FMI_BFP', 'LST_TBW', 'BFP_BMR', 'BFP_DEE', 'BMR_Weight', 'DEE_Weight',\n",
761
+ " 'SMM_Height', 'Muscle_to_Fat', 'Hydration_Status', 'ICW_TBW']\n",
762
+ "# Thêm các đặc trưng lấy từ parquet\n",
763
+ "trainingCols += parquet_cols\n",
764
+ "testingCols += parquet_cols\n",
765
+ "\n",
766
+ "# Cập nhật lại tập dữ liệu\n",
767
+ "train = train[trainingCols]\n",
768
+ "test = test[testingCols]\n",
769
+ "\n",
770
+ "# Xóa các cột sii bị rỗng\n",
771
+ "train = train.dropna(subset='sii')\n",
772
+ "\n",
773
+ "# Xử lý giá trị vô cùng\n",
774
+ "if np.any(np.isinf(train)):\n",
775
+ " train = train.replace([np.inf, -np.inf], np.nan)"
776
+ ]
777
+ },
778
+ {
779
+ "cell_type": "code",
780
+ "execution_count": 11,
781
+ "id": "67b75666",
782
+ "metadata": {
783
+ "execution": {
784
+ "iopub.execute_input": "2024-12-22T02:24:18.524362Z",
785
+ "iopub.status.busy": "2024-12-22T02:24:18.523758Z",
786
+ "iopub.status.idle": "2024-12-22T02:24:52.697201Z",
787
+ "shell.execute_reply": "2024-12-22T02:24:52.696347Z"
788
+ },
789
+ "papermill": {
790
+ "duration": 34.194455,
791
+ "end_time": "2024-12-22T02:24:52.698746",
792
+ "exception": false,
793
+ "start_time": "2024-12-22T02:24:18.504291",
794
+ "status": "completed"
795
+ },
796
+ "tags": []
797
+ },
798
+ "outputs": [
799
+ {
800
+ "name": "stderr",
801
+ "output_type": "stream",
802
+ "text": [
803
+ "Training Folds: 100%|██████████| 5/5 [00:34<00:00, 6.81s/it]"
804
+ ]
805
+ },
806
+ {
807
+ "name": "stdout",
808
+ "output_type": "stream",
809
+ "text": [
810
+ "QWK TB train --> 0.7698\n",
811
+ "QWK TB test ---> 0.4876\n",
812
+ "----> || Điểm QWK đã tối ưu :: \u001b[36m\u001b[1m 0.537\u001b[0m\n"
813
+ ]
814
+ },
815
+ {
816
+ "name": "stderr",
817
+ "output_type": "stream",
818
+ "text": [
819
+ "\n"
820
+ ]
821
+ },
822
+ {
823
+ "data": {
824
+ "text/html": [
825
+ "<div>\n",
826
+ "<style scoped>\n",
827
+ " .dataframe tbody tr th:only-of-type {\n",
828
+ " vertical-align: middle;\n",
829
+ " }\n",
830
+ "\n",
831
+ " .dataframe tbody tr th {\n",
832
+ " vertical-align: top;\n",
833
+ " }\n",
834
+ "\n",
835
+ " .dataframe thead th {\n",
836
+ " text-align: right;\n",
837
+ " }\n",
838
+ "</style>\n",
839
+ "<table border=\"1\" class=\"dataframe\">\n",
840
+ " <thead>\n",
841
+ " <tr style=\"text-align: right;\">\n",
842
+ " <th></th>\n",
843
+ " <th>id</th>\n",
844
+ " <th>sii</th>\n",
845
+ " </tr>\n",
846
+ " </thead>\n",
847
+ " <tbody>\n",
848
+ " <tr>\n",
849
+ " <th>0</th>\n",
850
+ " <td>00008ff9</td>\n",
851
+ " <td>1</td>\n",
852
+ " </tr>\n",
853
+ " <tr>\n",
854
+ " <th>1</th>\n",
855
+ " <td>000fd460</td>\n",
856
+ " <td>0</td>\n",
857
+ " </tr>\n",
858
+ " <tr>\n",
859
+ " <th>2</th>\n",
860
+ " <td>00105258</td>\n",
861
+ " <td>1</td>\n",
862
+ " </tr>\n",
863
+ " <tr>\n",
864
+ " <th>3</th>\n",
865
+ " <td>00115b9f</td>\n",
866
+ " <td>0</td>\n",
867
+ " </tr>\n",
868
+ " <tr>\n",
869
+ " <th>4</th>\n",
870
+ " <td>0016bb22</td>\n",
871
+ " <td>1</td>\n",
872
+ " </tr>\n",
873
+ " <tr>\n",
874
+ " <th>5</th>\n",
875
+ " <td>001f3379</td>\n",
876
+ " <td>1</td>\n",
877
+ " </tr>\n",
878
+ " <tr>\n",
879
+ " <th>6</th>\n",
880
+ " <td>0038ba98</td>\n",
881
+ " <td>1</td>\n",
882
+ " </tr>\n",
883
+ " <tr>\n",
884
+ " <th>7</th>\n",
885
+ " <td>0068a485</td>\n",
886
+ " <td>0</td>\n",
887
+ " </tr>\n",
888
+ " <tr>\n",
889
+ " <th>8</th>\n",
890
+ " <td>0069fbed</td>\n",
891
+ " <td>1</td>\n",
892
+ " </tr>\n",
893
+ " <tr>\n",
894
+ " <th>9</th>\n",
895
+ " <td>0083e397</td>\n",
896
+ " <td>0</td>\n",
897
+ " </tr>\n",
898
+ " <tr>\n",
899
+ " <th>10</th>\n",
900
+ " <td>0087dd65</td>\n",
901
+ " <td>0</td>\n",
902
+ " </tr>\n",
903
+ " <tr>\n",
904
+ " <th>11</th>\n",
905
+ " <td>00abe655</td>\n",
906
+ " <td>0</td>\n",
907
+ " </tr>\n",
908
+ " <tr>\n",
909
+ " <th>12</th>\n",
910
+ " <td>00ae59c9</td>\n",
911
+ " <td>1</td>\n",
912
+ " </tr>\n",
913
+ " <tr>\n",
914
+ " <th>13</th>\n",
915
+ " <td>00af6387</td>\n",
916
+ " <td>1</td>\n",
917
+ " </tr>\n",
918
+ " <tr>\n",
919
+ " <th>14</th>\n",
920
+ " <td>00bd4359</td>\n",
921
+ " <td>1</td>\n",
922
+ " </tr>\n",
923
+ " <tr>\n",
924
+ " <th>15</th>\n",
925
+ " <td>00c0cd71</td>\n",
926
+ " <td>1</td>\n",
927
+ " </tr>\n",
928
+ " <tr>\n",
929
+ " <th>16</th>\n",
930
+ " <td>00d56d4b</td>\n",
931
+ " <td>0</td>\n",
932
+ " </tr>\n",
933
+ " <tr>\n",
934
+ " <th>17</th>\n",
935
+ " <td>00d9913d</td>\n",
936
+ " <td>1</td>\n",
937
+ " </tr>\n",
938
+ " <tr>\n",
939
+ " <th>18</th>\n",
940
+ " <td>00e6167c</td>\n",
941
+ " <td>0</td>\n",
942
+ " </tr>\n",
943
+ " <tr>\n",
944
+ " <th>19</th>\n",
945
+ " <td>00ebc35d</td>\n",
946
+ " <td>1</td>\n",
947
+ " </tr>\n",
948
+ " </tbody>\n",
949
+ "</table>\n",
950
+ "</div>"
951
+ ],
952
+ "text/plain": [
953
+ " id sii\n",
954
+ "0 00008ff9 1\n",
955
+ "1 000fd460 0\n",
956
+ "2 00105258 1\n",
957
+ "3 00115b9f 0\n",
958
+ "4 0016bb22 1\n",
959
+ "5 001f3379 1\n",
960
+ "6 0038ba98 1\n",
961
+ "7 0068a485 0\n",
962
+ "8 0069fbed 1\n",
963
+ "9 0083e397 0\n",
964
+ "10 0087dd65 0\n",
965
+ "11 00abe655 0\n",
966
+ "12 00ae59c9 1\n",
967
+ "13 00af6387 1\n",
968
+ "14 00bd4359 1\n",
969
+ "15 00c0cd71 1\n",
970
+ "16 00d56d4b 0\n",
971
+ "17 00d9913d 1\n",
972
+ "18 00e6167c 0\n",
973
+ "19 00ebc35d 1"
974
+ ]
975
+ },
976
+ "execution_count": 11,
977
+ "metadata": {},
978
+ "output_type": "execute_result"
979
+ }
980
+ ],
981
+ "source": [
982
+ "Submission1 = TrainingModel(voting_model, test)\n",
983
+ "\n",
984
+ "Submission1"
985
+ ]
986
+ },
987
+ {
988
+ "cell_type": "markdown",
989
+ "id": "20076c4d",
990
+ "metadata": {
991
+ "papermill": {
992
+ "duration": 0.018445,
993
+ "end_time": "2024-12-22T02:24:52.736989",
994
+ "exception": false,
995
+ "start_time": "2024-12-22T02:24:52.718544",
996
+ "status": "completed"
997
+ },
998
+ "tags": []
999
+ },
1000
+ "source": [
1001
+ "# Submission 2"
1002
+ ]
1003
+ },
1004
+ {
1005
+ "cell_type": "code",
1006
+ "execution_count": 12,
1007
+ "id": "6e8a8410",
1008
+ "metadata": {
1009
+ "execution": {
1010
+ "iopub.execute_input": "2024-12-22T02:24:52.774560Z",
1011
+ "iopub.status.busy": "2024-12-22T02:24:52.774283Z",
1012
+ "iopub.status.idle": "2024-12-22T02:26:00.818617Z",
1013
+ "shell.execute_reply": "2024-12-22T02:26:00.817729Z"
1014
+ },
1015
+ "papermill": {
1016
+ "duration": 68.064588,
1017
+ "end_time": "2024-12-22T02:26:00.819911",
1018
+ "exception": false,
1019
+ "start_time": "2024-12-22T02:24:52.755323",
1020
+ "status": "completed"
1021
+ },
1022
+ "tags": []
1023
+ },
1024
+ "outputs": [
1025
+ {
1026
+ "name": "stderr",
1027
+ "output_type": "stream",
1028
+ "text": [
1029
+ "100%|██████████| 996/996 [01:07<00:00, 14.70it/s]\n",
1030
+ "100%|██████████| 2/2 [00:00<00:00, 12.87it/s]\n"
1031
+ ]
1032
+ }
1033
+ ],
1034
+ "source": [
1035
+ "train = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/train.csv')\n",
1036
+ "test = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/test.csv')\n",
1037
+ "sample = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv')\n",
1038
+ "train_pq = load_parquet_file(\"/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet\")\n",
1039
+ "test_pq = load_parquet_file(\"/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet\")\n",
1040
+ "\n",
1041
+ "pq_cols = train_pq.columns.tolist()\n",
1042
+ "pq_cols.remove(\"id\")\n",
1043
+ "\n",
1044
+ "train = pd.merge(train, train_pq, how=\"left\", on='id')\n",
1045
+ "test = pd.merge(test, test_pq, how=\"left\", on='id')\n",
1046
+ "\n",
1047
+ "train = train.drop('id', axis=1)\n",
1048
+ "test = test.drop('id', axis=1) "
1049
+ ]
1050
+ },
1051
+ {
1052
+ "cell_type": "code",
1053
+ "execution_count": 13,
1054
+ "id": "5d148efe",
1055
+ "metadata": {
1056
+ "execution": {
1057
+ "iopub.execute_input": "2024-12-22T02:26:00.887973Z",
1058
+ "iopub.status.busy": "2024-12-22T02:26:00.887683Z",
1059
+ "iopub.status.idle": "2024-12-22T02:26:00.896872Z",
1060
+ "shell.execute_reply": "2024-12-22T02:26:00.896194Z"
1061
+ },
1062
+ "papermill": {
1063
+ "duration": 0.044616,
1064
+ "end_time": "2024-12-22T02:26:00.898182",
1065
+ "exception": false,
1066
+ "start_time": "2024-12-22T02:26:00.853566",
1067
+ "status": "completed"
1068
+ },
1069
+ "tags": []
1070
+ },
1071
+ "outputs": [],
1072
+ "source": [
1073
+ "trainingCols = ['Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',\n",
1074
+ " 'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',\n",
1075
+ " 'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',\n",
1076
+ " 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',\n",
1077
+ " 'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage',\n",
1078
+ " 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',\n",
1079
+ " 'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',\n",
1080
+ " 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',\n",
1081
+ " 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',\n",
1082
+ " 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-Season',\n",
1083
+ " 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',\n",
1084
+ " 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',\n",
1085
+ " 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',\n",
1086
+ " 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM',\n",
1087
+ " 'BIA-BIA_TBW', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total', 'PAQ_C-Season',\n",
1088
+ " 'PAQ_C-PAQ_C_Total', 'SDS-Season', 'SDS-SDS_Total_Raw',\n",
1089
+ " 'SDS-SDS_Total_T', 'PreInt_EduHx-Season',\n",
1090
+ " 'PreInt_EduHx-computerinternet_hoursday', 'sii']\n",
1091
+ "\n",
1092
+ "trainingCols += pq_cols\n",
1093
+ "train = train[trainingCols]\n",
1094
+ "train = train.dropna(subset='sii')"
1095
+ ]
1096
+ },
1097
+ {
1098
+ "cell_type": "markdown",
1099
+ "id": "e4f4cffd",
1100
+ "metadata": {
1101
+ "papermill": {
1102
+ "duration": 0.032505,
1103
+ "end_time": "2024-12-22T02:26:00.964002",
1104
+ "exception": false,
1105
+ "start_time": "2024-12-22T02:26:00.931497",
1106
+ "status": "completed"
1107
+ },
1108
+ "tags": []
1109
+ },
1110
+ "source": [
1111
+ "Xử lý các cột phân loại"
1112
+ ]
1113
+ },
1114
+ {
1115
+ "cell_type": "code",
1116
+ "execution_count": 14,
1117
+ "id": "3f4acdf8",
1118
+ "metadata": {
1119
+ "execution": {
1120
+ "iopub.execute_input": "2024-12-22T02:26:01.030711Z",
1121
+ "iopub.status.busy": "2024-12-22T02:26:01.030343Z",
1122
+ "iopub.status.idle": "2024-12-22T02:26:01.084012Z",
1123
+ "shell.execute_reply": "2024-12-22T02:26:01.083264Z"
1124
+ },
1125
+ "papermill": {
1126
+ "duration": 0.088597,
1127
+ "end_time": "2024-12-22T02:26:01.085367",
1128
+ "exception": false,
1129
+ "start_time": "2024-12-22T02:26:00.996770",
1130
+ "status": "completed"
1131
+ },
1132
+ "tags": []
1133
+ },
1134
+ "outputs": [],
1135
+ "source": [
1136
+ "categoryFeatures = ['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season', \n",
1137
+ " 'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', \n",
1138
+ " 'PAQ_A-Season', 'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season']\n",
1139
+ "\n",
1140
+ "def update(df):\n",
1141
+ " global categoryFeatures\n",
1142
+ " for c in categoryFeatures: \n",
1143
+ " df[c] = df[c].fillna('Missing')\n",
1144
+ " df[c] = df[c].astype('category')\n",
1145
+ " return df\n",
1146
+ " \n",
1147
+ "train = update(train)\n",
1148
+ "test = update(test)\n",
1149
+ "\n",
1150
+ "# Hàm ánh xạ sang dạng enum\n",
1151
+ "def create_mapping(column, dataset):\n",
1152
+ " unique_values = dataset[column].unique()\n",
1153
+ " return {value: idx for idx, value in enumerate(unique_values)}\n",
1154
+ "\n",
1155
+ "for col in categoryFeatures:\n",
1156
+ " mapping = create_mapping(col, train)\n",
1157
+ " mappingTe = create_mapping(col, test)\n",
1158
+ " \n",
1159
+ " train[col] = train[col].replace(mapping).astype(int)\n",
1160
+ " test[col] = test[col].replace(mappingTe).astype(int)"
1161
+ ]
1162
+ },
1163
+ {
1164
+ "cell_type": "code",
1165
+ "execution_count": 15,
1166
+ "id": "10b08a36",
1167
+ "metadata": {
1168
+ "execution": {
1169
+ "iopub.execute_input": "2024-12-22T02:26:01.151607Z",
1170
+ "iopub.status.busy": "2024-12-22T02:26:01.151297Z",
1171
+ "iopub.status.idle": "2024-12-22T02:26:14.923765Z",
1172
+ "shell.execute_reply": "2024-12-22T02:26:14.922695Z"
1173
+ },
1174
+ "papermill": {
1175
+ "duration": 13.807665,
1176
+ "end_time": "2024-12-22T02:26:14.925912",
1177
+ "exception": false,
1178
+ "start_time": "2024-12-22T02:26:01.118247",
1179
+ "status": "completed"
1180
+ },
1181
+ "tags": []
1182
+ },
1183
+ "outputs": [
1184
+ {
1185
+ "name": "stderr",
1186
+ "output_type": "stream",
1187
+ "text": [
1188
+ "Training Folds: 100%|██████████| 5/5 [00:13<00:00, 2.70s/it]"
1189
+ ]
1190
+ },
1191
+ {
1192
+ "name": "stdout",
1193
+ "output_type": "stream",
1194
+ "text": [
1195
+ "QWK TB train --> 0.7259\n",
1196
+ "QWK TB test ---> 0.3804\n"
1197
+ ]
1198
+ },
1199
+ {
1200
+ "name": "stderr",
1201
+ "output_type": "stream",
1202
+ "text": [
1203
+ "\n"
1204
+ ]
1205
+ },
1206
+ {
1207
+ "name": "stdout",
1208
+ "output_type": "stream",
1209
+ "text": [
1210
+ "----> || Điểm QWK đã tối ưu :: \u001b[36m\u001b[1m 0.464\u001b[0m\n"
1211
+ ]
1212
+ },
1213
+ {
1214
+ "data": {
1215
+ "text/html": [
1216
+ "<div>\n",
1217
+ "<style scoped>\n",
1218
+ " .dataframe tbody tr th:only-of-type {\n",
1219
+ " vertical-align: middle;\n",
1220
+ " }\n",
1221
+ "\n",
1222
+ " .dataframe tbody tr th {\n",
1223
+ " vertical-align: top;\n",
1224
+ " }\n",
1225
+ "\n",
1226
+ " .dataframe thead th {\n",
1227
+ " text-align: right;\n",
1228
+ " }\n",
1229
+ "</style>\n",
1230
+ "<table border=\"1\" class=\"dataframe\">\n",
1231
+ " <thead>\n",
1232
+ " <tr style=\"text-align: right;\">\n",
1233
+ " <th></th>\n",
1234
+ " <th>id</th>\n",
1235
+ " <th>sii</th>\n",
1236
+ " </tr>\n",
1237
+ " </thead>\n",
1238
+ " <tbody>\n",
1239
+ " <tr>\n",
1240
+ " <th>0</th>\n",
1241
+ " <td>00008ff9</td>\n",
1242
+ " <td>1</td>\n",
1243
+ " </tr>\n",
1244
+ " <tr>\n",
1245
+ " <th>1</th>\n",
1246
+ " <td>000fd460</td>\n",
1247
+ " <td>0</td>\n",
1248
+ " </tr>\n",
1249
+ " <tr>\n",
1250
+ " <th>2</th>\n",
1251
+ " <td>00105258</td>\n",
1252
+ " <td>0</td>\n",
1253
+ " </tr>\n",
1254
+ " <tr>\n",
1255
+ " <th>3</th>\n",
1256
+ " <td>00115b9f</td>\n",
1257
+ " <td>0</td>\n",
1258
+ " </tr>\n",
1259
+ " <tr>\n",
1260
+ " <th>4</th>\n",
1261
+ " <td>0016bb22</td>\n",
1262
+ " <td>0</td>\n",
1263
+ " </tr>\n",
1264
+ " <tr>\n",
1265
+ " <th>5</th>\n",
1266
+ " <td>001f3379</td>\n",
1267
+ " <td>1</td>\n",
1268
+ " </tr>\n",
1269
+ " <tr>\n",
1270
+ " <th>6</th>\n",
1271
+ " <td>0038ba98</td>\n",
1272
+ " <td>0</td>\n",
1273
+ " </tr>\n",
1274
+ " <tr>\n",
1275
+ " <th>7</th>\n",
1276
+ " <td>0068a485</td>\n",
1277
+ " <td>0</td>\n",
1278
+ " </tr>\n",
1279
+ " <tr>\n",
1280
+ " <th>8</th>\n",
1281
+ " <td>0069fbed</td>\n",
1282
+ " <td>1</td>\n",
1283
+ " </tr>\n",
1284
+ " <tr>\n",
1285
+ " <th>9</th>\n",
1286
+ " <td>0083e397</td>\n",
1287
+ " <td>0</td>\n",
1288
+ " </tr>\n",
1289
+ " <tr>\n",
1290
+ " <th>10</th>\n",
1291
+ " <td>0087dd65</td>\n",
1292
+ " <td>0</td>\n",
1293
+ " </tr>\n",
1294
+ " <tr>\n",
1295
+ " <th>11</th>\n",
1296
+ " <td>00abe655</td>\n",
1297
+ " <td>0</td>\n",
1298
+ " </tr>\n",
1299
+ " <tr>\n",
1300
+ " <th>12</th>\n",
1301
+ " <td>00ae59c9</td>\n",
1302
+ " <td>1</td>\n",
1303
+ " </tr>\n",
1304
+ " <tr>\n",
1305
+ " <th>13</th>\n",
1306
+ " <td>00af6387</td>\n",
1307
+ " <td>1</td>\n",
1308
+ " </tr>\n",
1309
+ " <tr>\n",
1310
+ " <th>14</th>\n",
1311
+ " <td>00bd4359</td>\n",
1312
+ " <td>1</td>\n",
1313
+ " </tr>\n",
1314
+ " <tr>\n",
1315
+ " <th>15</th>\n",
1316
+ " <td>00c0cd71</td>\n",
1317
+ " <td>1</td>\n",
1318
+ " </tr>\n",
1319
+ " <tr>\n",
1320
+ " <th>16</th>\n",
1321
+ " <td>00d56d4b</td>\n",
1322
+ " <td>0</td>\n",
1323
+ " </tr>\n",
1324
+ " <tr>\n",
1325
+ " <th>17</th>\n",
1326
+ " <td>00d9913d</td>\n",
1327
+ " <td>0</td>\n",
1328
+ " </tr>\n",
1329
+ " <tr>\n",
1330
+ " <th>18</th>\n",
1331
+ " <td>00e6167c</td>\n",
1332
+ " <td>0</td>\n",
1333
+ " </tr>\n",
1334
+ " <tr>\n",
1335
+ " <th>19</th>\n",
1336
+ " <td>00ebc35d</td>\n",
1337
+ " <td>0</td>\n",
1338
+ " </tr>\n",
1339
+ " </tbody>\n",
1340
+ "</table>\n",
1341
+ "</div>"
1342
+ ],
1343
+ "text/plain": [
1344
+ " id sii\n",
1345
+ "0 00008ff9 1\n",
1346
+ "1 000fd460 0\n",
1347
+ "2 00105258 0\n",
1348
+ "3 00115b9f 0\n",
1349
+ "4 0016bb22 0\n",
1350
+ "5 001f3379 1\n",
1351
+ "6 0038ba98 0\n",
1352
+ "7 0068a485 0\n",
1353
+ "8 0069fbed 1\n",
1354
+ "9 0083e397 0\n",
1355
+ "10 0087dd65 0\n",
1356
+ "11 00abe655 0\n",
1357
+ "12 00ae59c9 1\n",
1358
+ "13 00af6387 1\n",
1359
+ "14 00bd4359 1\n",
1360
+ "15 00c0cd71 1\n",
1361
+ "16 00d56d4b 0\n",
1362
+ "17 00d9913d 0\n",
1363
+ "18 00e6167c 0\n",
1364
+ "19 00ebc35d 0"
1365
+ ]
1366
+ },
1367
+ "execution_count": 15,
1368
+ "metadata": {},
1369
+ "output_type": "execute_result"
1370
+ }
1371
+ ],
1372
+ "source": [
1373
+ "Submission2 = TrainingModel(voting_model, test)\n",
1374
+ "\n",
1375
+ "Submission2"
1376
+ ]
1377
+ },
1378
+ {
1379
+ "cell_type": "markdown",
1380
+ "id": "f34b237c",
1381
+ "metadata": {
1382
+ "papermill": {
1383
+ "duration": 0.033669,
1384
+ "end_time": "2024-12-22T02:26:15.009998",
1385
+ "exception": false,
1386
+ "start_time": "2024-12-22T02:26:14.976329",
1387
+ "status": "completed"
1388
+ },
1389
+ "tags": []
1390
+ },
1391
+ "source": [
1392
+ "# Submission 3"
1393
+ ]
1394
+ },
1395
+ {
1396
+ "cell_type": "code",
1397
+ "execution_count": 16,
1398
+ "id": "a57cc9b5",
1399
+ "metadata": {
1400
+ "execution": {
1401
+ "iopub.execute_input": "2024-12-22T02:26:15.078216Z",
1402
+ "iopub.status.busy": "2024-12-22T02:26:15.077888Z",
1403
+ "iopub.status.idle": "2024-12-22T02:28:14.547425Z",
1404
+ "shell.execute_reply": "2024-12-22T02:28:14.546584Z"
1405
+ },
1406
+ "papermill": {
1407
+ "duration": 119.505453,
1408
+ "end_time": "2024-12-22T02:28:14.548971",
1409
+ "exception": false,
1410
+ "start_time": "2024-12-22T02:26:15.043518",
1411
+ "status": "completed"
1412
+ },
1413
+ "tags": []
1414
+ },
1415
+ "outputs": [
1416
+ {
1417
+ "name": "stderr",
1418
+ "output_type": "stream",
1419
+ "text": [
1420
+ "Training Folds: 100%|██████████| 5/5 [01:59<00:00, 23.85s/it]"
1421
+ ]
1422
+ },
1423
+ {
1424
+ "name": "stdout",
1425
+ "output_type": "stream",
1426
+ "text": [
1427
+ "QWK TB train --> 0.9175\n",
1428
+ "QWK TB test ---> 0.3803\n"
1429
+ ]
1430
+ },
1431
+ {
1432
+ "name": "stderr",
1433
+ "output_type": "stream",
1434
+ "text": [
1435
+ "\n"
1436
+ ]
1437
+ },
1438
+ {
1439
+ "name": "stdout",
1440
+ "output_type": "stream",
1441
+ "text": [
1442
+ "----> || Điểm QWK đã tối ưu :: \u001b[36m\u001b[1m 0.450\u001b[0m\n"
1443
+ ]
1444
+ },
1445
+ {
1446
+ "data": {
1447
+ "text/html": [
1448
+ "<div>\n",
1449
+ "<style scoped>\n",
1450
+ " .dataframe tbody tr th:only-of-type {\n",
1451
+ " vertical-align: middle;\n",
1452
+ " }\n",
1453
+ "\n",
1454
+ " .dataframe tbody tr th {\n",
1455
+ " vertical-align: top;\n",
1456
+ " }\n",
1457
+ "\n",
1458
+ " .dataframe thead th {\n",
1459
+ " text-align: right;\n",
1460
+ " }\n",
1461
+ "</style>\n",
1462
+ "<table border=\"1\" class=\"dataframe\">\n",
1463
+ " <thead>\n",
1464
+ " <tr style=\"text-align: right;\">\n",
1465
+ " <th></th>\n",
1466
+ " <th>id</th>\n",
1467
+ " <th>sii</th>\n",
1468
+ " </tr>\n",
1469
+ " </thead>\n",
1470
+ " <tbody>\n",
1471
+ " <tr>\n",
1472
+ " <th>0</th>\n",
1473
+ " <td>00008ff9</td>\n",
1474
+ " <td>2</td>\n",
1475
+ " </tr>\n",
1476
+ " <tr>\n",
1477
+ " <th>1</th>\n",
1478
+ " <td>000fd460</td>\n",
1479
+ " <td>0</td>\n",
1480
+ " </tr>\n",
1481
+ " <tr>\n",
1482
+ " <th>2</th>\n",
1483
+ " <td>00105258</td>\n",
1484
+ " <td>0</td>\n",
1485
+ " </tr>\n",
1486
+ " <tr>\n",
1487
+ " <th>3</th>\n",
1488
+ " <td>00115b9f</td>\n",
1489
+ " <td>0</td>\n",
1490
+ " </tr>\n",
1491
+ " <tr>\n",
1492
+ " <th>4</th>\n",
1493
+ " <td>0016bb22</td>\n",
1494
+ " <td>1</td>\n",
1495
+ " </tr>\n",
1496
+ " <tr>\n",
1497
+ " <th>5</th>\n",
1498
+ " <td>001f3379</td>\n",
1499
+ " <td>1</td>\n",
1500
+ " </tr>\n",
1501
+ " <tr>\n",
1502
+ " <th>6</th>\n",
1503
+ " <td>0038ba98</td>\n",
1504
+ " <td>0</td>\n",
1505
+ " </tr>\n",
1506
+ " <tr>\n",
1507
+ " <th>7</th>\n",
1508
+ " <td>0068a485</td>\n",
1509
+ " <td>0</td>\n",
1510
+ " </tr>\n",
1511
+ " <tr>\n",
1512
+ " <th>8</th>\n",
1513
+ " <td>0069fbed</td>\n",
1514
+ " <td>2</td>\n",
1515
+ " </tr>\n",
1516
+ " <tr>\n",
1517
+ " <th>9</th>\n",
1518
+ " <td>0083e397</td>\n",
1519
+ " <td>0</td>\n",
1520
+ " </tr>\n",
1521
+ " <tr>\n",
1522
+ " <th>10</th>\n",
1523
+ " <td>0087dd65</td>\n",
1524
+ " <td>1</td>\n",
1525
+ " </tr>\n",
1526
+ " <tr>\n",
1527
+ " <th>11</th>\n",
1528
+ " <td>00abe655</td>\n",
1529
+ " <td>0</td>\n",
1530
+ " </tr>\n",
1531
+ " <tr>\n",
1532
+ " <th>12</th>\n",
1533
+ " <td>00ae59c9</td>\n",
1534
+ " <td>2</td>\n",
1535
+ " </tr>\n",
1536
+ " <tr>\n",
1537
+ " <th>13</th>\n",
1538
+ " <td>00af6387</td>\n",
1539
+ " <td>1</td>\n",
1540
+ " </tr>\n",
1541
+ " <tr>\n",
1542
+ " <th>14</th>\n",
1543
+ " <td>00bd4359</td>\n",
1544
+ " <td>2</td>\n",
1545
+ " </tr>\n",
1546
+ " <tr>\n",
1547
+ " <th>15</th>\n",
1548
+ " <td>00c0cd71</td>\n",
1549
+ " <td>2</td>\n",
1550
+ " </tr>\n",
1551
+ " <tr>\n",
1552
+ " <th>16</th>\n",
1553
+ " <td>00d56d4b</td>\n",
1554
+ " <td>0</td>\n",
1555
+ " </tr>\n",
1556
+ " <tr>\n",
1557
+ " <th>17</th>\n",
1558
+ " <td>00d9913d</td>\n",
1559
+ " <td>0</td>\n",
1560
+ " </tr>\n",
1561
+ " <tr>\n",
1562
+ " <th>18</th>\n",
1563
+ " <td>00e6167c</td>\n",
1564
+ " <td>0</td>\n",
1565
+ " </tr>\n",
1566
+ " <tr>\n",
1567
+ " <th>19</th>\n",
1568
+ " <td>00ebc35d</td>\n",
1569
+ " <td>1</td>\n",
1570
+ " </tr>\n",
1571
+ " </tbody>\n",
1572
+ "</table>\n",
1573
+ "</div>"
1574
+ ],
1575
+ "text/plain": [
1576
+ " id sii\n",
1577
+ "0 00008ff9 2\n",
1578
+ "1 000fd460 0\n",
1579
+ "2 00105258 0\n",
1580
+ "3 00115b9f 0\n",
1581
+ "4 0016bb22 1\n",
1582
+ "5 001f3379 1\n",
1583
+ "6 0038ba98 0\n",
1584
+ "7 0068a485 0\n",
1585
+ "8 0069fbed 2\n",
1586
+ "9 0083e397 0\n",
1587
+ "10 0087dd65 1\n",
1588
+ "11 00abe655 0\n",
1589
+ "12 00ae59c9 2\n",
1590
+ "13 00af6387 1\n",
1591
+ "14 00bd4359 2\n",
1592
+ "15 00c0cd71 2\n",
1593
+ "16 00d56d4b 0\n",
1594
+ "17 00d9913d 0\n",
1595
+ "18 00e6167c 0\n",
1596
+ "19 00ebc35d 1"
1597
+ ]
1598
+ },
1599
+ "execution_count": 16,
1600
+ "metadata": {},
1601
+ "output_type": "execute_result"
1602
+ }
1603
+ ],
1604
+ "source": [
1605
+ "imputer = SimpleImputer(strategy='median')\n",
1606
+ "\n",
1607
+ "ensemble = VotingRegressor(estimators=[\n",
1608
+ " ('lgb', Pipeline(steps=[('imputer', imputer), ('regressor', LGBMRegressor(random_state=SEED))])),\n",
1609
+ " ('xgb', Pipeline(steps=[('imputer', imputer), ('regressor', XGBRegressor(random_state=SEED))])),\n",
1610
+ " ('cat', Pipeline(steps=[('imputer', imputer), ('regressor', CatBoostRegressor(random_state=SEED, silent=True))])),\n",
1611
+ " ('rf', Pipeline(steps=[('imputer', imputer), ('regressor', RandomForestRegressor(random_state=SEED))])),\n",
1612
+ " ('gb', Pipeline(steps=[('imputer', imputer), ('regressor', GradientBoostingRegressor(random_state=SEED))]))\n",
1613
+ "])\n",
1614
+ "\n",
1615
+ "Submission3 = TrainingModel(ensemble, test)\n",
1616
+ "\n",
1617
+ "Submission3"
1618
+ ]
1619
+ },
1620
+ {
1621
+ "cell_type": "code",
1622
+ "execution_count": 17,
1623
+ "id": "03321cf9",
1624
+ "metadata": {
1625
+ "execution": {
1626
+ "iopub.execute_input": "2024-12-22T02:28:14.620213Z",
1627
+ "iopub.status.busy": "2024-12-22T02:28:14.619956Z",
1628
+ "iopub.status.idle": "2024-12-22T02:28:14.636832Z",
1629
+ "shell.execute_reply": "2024-12-22T02:28:14.635901Z"
1630
+ },
1631
+ "papermill": {
1632
+ "duration": 0.052844,
1633
+ "end_time": "2024-12-22T02:28:14.638109",
1634
+ "exception": false,
1635
+ "start_time": "2024-12-22T02:28:14.585265",
1636
+ "status": "completed"
1637
+ },
1638
+ "tags": []
1639
+ },
1640
+ "outputs": [
1641
+ {
1642
+ "name": "stdout",
1643
+ "output_type": "stream",
1644
+ "text": [
1645
+ "Majority voting completed and saved to 'Final_Submission.csv'\n"
1646
+ ]
1647
+ }
1648
+ ],
1649
+ "source": [
1650
+ "sub1 = Submission1\n",
1651
+ "sub2 = Submission2\n",
1652
+ "sub3 = Submission3\n",
1653
+ "\n",
1654
+ "sub1 = sub1.sort_values(by='id').reset_index(drop=True)\n",
1655
+ "sub2 = sub2.sort_values(by='id').reset_index(drop=True)\n",
1656
+ "sub3 = sub3.sort_values(by='id').reset_index(drop=True)\n",
1657
+ "\n",
1658
+ "combined = pd.DataFrame({\n",
1659
+ " 'id': sub1['id'],\n",
1660
+ " 'sii_1': sub1['sii'],\n",
1661
+ " 'sii_2': sub2['sii'],\n",
1662
+ " 'sii_3': sub3['sii']\n",
1663
+ "})\n",
1664
+ "\n",
1665
+ "def majority_vote(row):\n",
1666
+ " return row.mode()[0]\n",
1667
+ "\n",
1668
+ "combined['final_sii'] = combined[['sii_1', 'sii_2', 'sii_3']].apply(majority_vote, axis=1)\n",
1669
+ "\n",
1670
+ "final_submission = combined[['id', 'final_sii']].rename(columns={'final_sii': 'sii'})\n",
1671
+ "\n",
1672
+ "final_submission.to_csv('submission.csv', index=False)\n",
1673
+ "\n",
1674
+ "print(\"Majority voting completed and saved to 'Final_Submission.csv'\")"
1675
+ ]
1676
+ },
1677
+ {
1678
+ "cell_type": "code",
1679
+ "execution_count": 18,
1680
+ "id": "d403b134",
1681
+ "metadata": {
1682
+ "execution": {
1683
+ "iopub.execute_input": "2024-12-22T02:28:14.707888Z",
1684
+ "iopub.status.busy": "2024-12-22T02:28:14.707573Z",
1685
+ "iopub.status.idle": "2024-12-22T02:28:14.715425Z",
1686
+ "shell.execute_reply": "2024-12-22T02:28:14.714480Z"
1687
+ },
1688
+ "papermill": {
1689
+ "duration": 0.043866,
1690
+ "end_time": "2024-12-22T02:28:14.716800",
1691
+ "exception": false,
1692
+ "start_time": "2024-12-22T02:28:14.672934",
1693
+ "status": "completed"
1694
+ },
1695
+ "tags": []
1696
+ },
1697
+ "outputs": [
1698
+ {
1699
+ "data": {
1700
+ "text/html": [
1701
+ "<div>\n",
1702
+ "<style scoped>\n",
1703
+ " .dataframe tbody tr th:only-of-type {\n",
1704
+ " vertical-align: middle;\n",
1705
+ " }\n",
1706
+ "\n",
1707
+ " .dataframe tbody tr th {\n",
1708
+ " vertical-align: top;\n",
1709
+ " }\n",
1710
+ "\n",
1711
+ " .dataframe thead th {\n",
1712
+ " text-align: right;\n",
1713
+ " }\n",
1714
+ "</style>\n",
1715
+ "<table border=\"1\" class=\"dataframe\">\n",
1716
+ " <thead>\n",
1717
+ " <tr style=\"text-align: right;\">\n",
1718
+ " <th></th>\n",
1719
+ " <th>id</th>\n",
1720
+ " <th>sii</th>\n",
1721
+ " </tr>\n",
1722
+ " </thead>\n",
1723
+ " <tbody>\n",
1724
+ " <tr>\n",
1725
+ " <th>0</th>\n",
1726
+ " <td>00008ff9</td>\n",
1727
+ " <td>1</td>\n",
1728
+ " </tr>\n",
1729
+ " <tr>\n",
1730
+ " <th>1</th>\n",
1731
+ " <td>000fd460</td>\n",
1732
+ " <td>0</td>\n",
1733
+ " </tr>\n",
1734
+ " <tr>\n",
1735
+ " <th>2</th>\n",
1736
+ " <td>00105258</td>\n",
1737
+ " <td>0</td>\n",
1738
+ " </tr>\n",
1739
+ " <tr>\n",
1740
+ " <th>3</th>\n",
1741
+ " <td>00115b9f</td>\n",
1742
+ " <td>0</td>\n",
1743
+ " </tr>\n",
1744
+ " <tr>\n",
1745
+ " <th>4</th>\n",
1746
+ " <td>0016bb22</td>\n",
1747
+ " <td>1</td>\n",
1748
+ " </tr>\n",
1749
+ " <tr>\n",
1750
+ " <th>5</th>\n",
1751
+ " <td>001f3379</td>\n",
1752
+ " <td>1</td>\n",
1753
+ " </tr>\n",
1754
+ " <tr>\n",
1755
+ " <th>6</th>\n",
1756
+ " <td>0038ba98</td>\n",
1757
+ " <td>0</td>\n",
1758
+ " </tr>\n",
1759
+ " <tr>\n",
1760
+ " <th>7</th>\n",
1761
+ " <td>0068a485</td>\n",
1762
+ " <td>0</td>\n",
1763
+ " </tr>\n",
1764
+ " <tr>\n",
1765
+ " <th>8</th>\n",
1766
+ " <td>0069fbed</td>\n",
1767
+ " <td>1</td>\n",
1768
+ " </tr>\n",
1769
+ " <tr>\n",
1770
+ " <th>9</th>\n",
1771
+ " <td>0083e397</td>\n",
1772
+ " <td>0</td>\n",
1773
+ " </tr>\n",
1774
+ " <tr>\n",
1775
+ " <th>10</th>\n",
1776
+ " <td>0087dd65</td>\n",
1777
+ " <td>0</td>\n",
1778
+ " </tr>\n",
1779
+ " <tr>\n",
1780
+ " <th>11</th>\n",
1781
+ " <td>00abe655</td>\n",
1782
+ " <td>0</td>\n",
1783
+ " </tr>\n",
1784
+ " <tr>\n",
1785
+ " <th>12</th>\n",
1786
+ " <td>00ae59c9</td>\n",
1787
+ " <td>1</td>\n",
1788
+ " </tr>\n",
1789
+ " <tr>\n",
1790
+ " <th>13</th>\n",
1791
+ " <td>00af6387</td>\n",
1792
+ " <td>1</td>\n",
1793
+ " </tr>\n",
1794
+ " <tr>\n",
1795
+ " <th>14</th>\n",
1796
+ " <td>00bd4359</td>\n",
1797
+ " <td>1</td>\n",
1798
+ " </tr>\n",
1799
+ " <tr>\n",
1800
+ " <th>15</th>\n",
1801
+ " <td>00c0cd71</td>\n",
1802
+ " <td>1</td>\n",
1803
+ " </tr>\n",
1804
+ " <tr>\n",
1805
+ " <th>16</th>\n",
1806
+ " <td>00d56d4b</td>\n",
1807
+ " <td>0</td>\n",
1808
+ " </tr>\n",
1809
+ " <tr>\n",
1810
+ " <th>17</th>\n",
1811
+ " <td>00d9913d</td>\n",
1812
+ " <td>0</td>\n",
1813
+ " </tr>\n",
1814
+ " <tr>\n",
1815
+ " <th>18</th>\n",
1816
+ " <td>00e6167c</td>\n",
1817
+ " <td>0</td>\n",
1818
+ " </tr>\n",
1819
+ " <tr>\n",
1820
+ " <th>19</th>\n",
1821
+ " <td>00ebc35d</td>\n",
1822
+ " <td>1</td>\n",
1823
+ " </tr>\n",
1824
+ " </tbody>\n",
1825
+ "</table>\n",
1826
+ "</div>"
1827
+ ],
1828
+ "text/plain": [
1829
+ " id sii\n",
1830
+ "0 00008ff9 1\n",
1831
+ "1 000fd460 0\n",
1832
+ "2 00105258 0\n",
1833
+ "3 00115b9f 0\n",
1834
+ "4 0016bb22 1\n",
1835
+ "5 001f3379 1\n",
1836
+ "6 0038ba98 0\n",
1837
+ "7 0068a485 0\n",
1838
+ "8 0069fbed 1\n",
1839
+ "9 0083e397 0\n",
1840
+ "10 0087dd65 0\n",
1841
+ "11 00abe655 0\n",
1842
+ "12 00ae59c9 1\n",
1843
+ "13 00af6387 1\n",
1844
+ "14 00bd4359 1\n",
1845
+ "15 00c0cd71 1\n",
1846
+ "16 00d56d4b 0\n",
1847
+ "17 00d9913d 0\n",
1848
+ "18 00e6167c 0\n",
1849
+ "19 00ebc35d 1"
1850
+ ]
1851
+ },
1852
+ "execution_count": 18,
1853
+ "metadata": {},
1854
+ "output_type": "execute_result"
1855
+ }
1856
+ ],
1857
+ "source": [
1858
+ "final_submission"
1859
+ ]
1860
+ }
1861
+ ],
1862
+ "metadata": {
1863
+ "kaggle": {
1864
+ "accelerator": "gpu",
1865
+ "dataSources": [
1866
+ {
1867
+ "databundleVersionId": 9643020,
1868
+ "sourceId": 81933,
1869
+ "sourceType": "competition"
1870
+ }
1871
+ ],
1872
+ "dockerImageVersionId": 30823,
1873
+ "isGpuEnabled": true,
1874
+ "isInternetEnabled": false,
1875
+ "language": "python",
1876
+ "sourceType": "notebook"
1877
+ },
1878
+ "kernelspec": {
1879
+ "display_name": "Python 3",
1880
+ "language": "python",
1881
+ "name": "python3"
1882
+ },
1883
+ "language_info": {
1884
+ "codemirror_mode": {
1885
+ "name": "ipython",
1886
+ "version": 3
1887
+ },
1888
+ "file_extension": ".py",
1889
+ "mimetype": "text/x-python",
1890
+ "name": "python",
1891
+ "nbconvert_exporter": "python",
1892
+ "pygments_lexer": "ipython3",
1893
+ "version": "3.10.12"
1894
+ },
1895
+ "papermill": {
1896
+ "default_parameters": {},
1897
+ "duration": 346.01461,
1898
+ "end_time": "2024-12-22T02:28:17.567852",
1899
+ "environment_variables": {},
1900
+ "exception": null,
1901
+ "input_path": "__notebook__.ipynb",
1902
+ "output_path": "__notebook__.ipynb",
1903
+ "parameters": {},
1904
+ "start_time": "2024-12-22T02:22:31.553242",
1905
+ "version": "2.6.0"
1906
+ }
1907
+ },
1908
+ "nbformat": 4,
1909
+ "nbformat_minor": 5
1910
+ }