sdtemple commited on
Commit
c5fa54c
·
1 Parent(s): 40e0ac2

third iterate models

Browse files
README.md CHANGED
@@ -26,10 +26,13 @@ The 24 features are of the characteristics:
26
  1. Sound frequency percentiles
27
  - https://www.tandfonline.com/doi/full/10.1080/09524622.2020.1730241
28
  - quantiles = [0.8,0.9,0.925,0.95,0.975,0.99,]
29
- 2. 13 Mel spectrogram cepstral coefficients
30
- - Summary statistics of zero crossing rates in 1-second segments
 
 
31
  - Mean, standard deviations, min, max
32
- Zero crossing rate of the entire clip
 
33
 
34
 
35
  These features summarize an entire clip, irrespective of position in waveform or spectrogram, and technically, the clip does not have to be 5 seconds long.
@@ -74,13 +77,13 @@ First model iterate
74
  1. Fit decision tree-based classifiers to Freefield and Warblrb10k
75
  - The Warblrb10k data is about 3/4 does have bird
76
  - The Freefield data is about 1/4 does not have bird
77
- - I prioritized simpler models, as it is very easy to overfit to predict dataset
78
  2. Grid search with 25% test and 75% training splits (averaging over 5 randomizations)
79
- - RandomForestClassifier, GradientBoostingClassifier, XGBoostClassifier
80
  - n_estimators: [10, 20, 50,]
81
  - max_depth: [5, 10, 20,]
82
  - I saved the results in the following file
83
- 3. I chose to use the XGBoostClassifier with n_estimators=20 and max_depth=5
84
  - This simpler model does not have too large a gap between training and test metrics
85
  - The test accuracy is
86
  - The test precision is
@@ -90,24 +93,34 @@ First model iterate
90
  Second model iterate
91
  ---
92
 
93
- 1. Fit XGBoost classifier to all heretofore mentioned data
94
  2. Use first model iterate to predict "hasbird" in Birdclef data
95
  - Apply zero padding to the Birdclef data if the final clip longer than 2 seconds
96
  - Subset Birdclef data to those with
97
  - Predicted presence > 0.75, or
98
  - Audio file duration <= 15 seconds, or
99
  - Amphibian, Insecta, Mammalia as 0 in 2025 data
100
- 3. Grid search with 25% test and 75% training splits (averaging over 10 randomizations)
101
- n_estimators: [10, 20, 50,]
102
- max_depth: [5, 10, 20,]
 
 
 
 
 
 
 
 
 
 
103
 
104
  Third model iterate
105
  ---
106
 
107
- 1. Fit XGBoostClassifier to all heretofore mentioned data
108
  2. Use the second model iterate to predict "hasbird" in Birdclef data
109
- - Subset Birdclef data to those wth
110
- - Predicted presence >
111
  - Amphibia, Insecta, Mammalia as 0 in 2025 data
112
 
113
  Non-2025 model
 
26
  1. Sound frequency percentiles
27
  - https://www.tandfonline.com/doi/full/10.1080/09524622.2020.1730241
28
  - quantiles = [0.8,0.9,0.925,0.95,0.975,0.99,]
29
+ 2. Thirteen Mel spectrogram cepstral coefficients
30
+ - Averaged over axis 1 (columns)
31
+ - n_fft=2048, hop_length=512
32
+ 3. Summary statistics of zero crossing rates in 1-second segments
33
  - Mean, standard deviations, min, max
34
+ - Zero crossing rate of the entire clip
35
+ - Threshold of 0.02
36
 
37
 
38
  These features summarize an entire clip, irrespective of position in waveform or spectrogram, and technically, the clip does not have to be 5 seconds long.
 
77
  1. Fit decision tree-based classifiers to Freefield and Warblrb10k
78
  - The Warblrb10k data is about 3/4 does have bird
79
  - The Freefield data is about 1/4 does not have bird
80
+ - No data augmentation
81
  2. Grid search with 25% test and 75% training splits (averaging over 5 randomizations)
82
+ - `RandomForestClassifier`, `GradientBoostingClassifier`, `XGBClassifier`
83
  - n_estimators: [10, 20, 50,]
84
  - max_depth: [5, 10, 20,]
85
  - I saved the results in the following file
86
+ 3. I chose to use the `XGBClassifier` with n_estimators=20 and max_depth=5
87
  - This simpler model does not have too large a gap between training and test metrics
88
  - The test accuracy is
89
  - The test precision is
 
93
  Second model iterate
94
  ---
95
 
96
+ 1. Fit `XGBClassifier` to all heretofore mentioned data
97
  2. Use first model iterate to predict "hasbird" in Birdclef data
98
  - Apply zero padding to the Birdclef data if the final clip longer than 2 seconds
99
  - Subset Birdclef data to those with
100
  - Predicted presence > 0.75, or
101
  - Audio file duration <= 15 seconds, or
102
  - Amphibian, Insecta, Mammalia as 0 in 2025 data
103
+ 3. Five data augmented instances for each file
104
+ - Use OneOf in audiomentations
105
+ - AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=1.)
106
+ - AddGaussianSNR(min_snr_db=5.0, max_snr_db=40.0, p=1.)
107
+ - AddColorNoise(min_snr_db=5.0, max_snr_db=40.0, n_fft=128, p=1.)
108
+ 4. Grid search with 25% test and 75% training splits (averaging over 10 randomizations)
109
+ - n_estimators: [10, 20, 50,]
110
+ - max_depth: [5, 10, 20,]
111
+ 5. I chose the XGBClassifier with 50 estimators and max depth 10
112
+ - The test accuracy is
113
+ - The test precision is
114
+ - The test recall is
115
+ - The test AUROC is
116
 
117
  Third model iterate
118
  ---
119
 
120
+ 1. Fit `XGBClassifier` to all heretofore mentioned data
121
  2. Use the second model iterate to predict "hasbird" in Birdclef data
122
+ 3. Subset Birdclef data to those wth
123
+ - Predicted presence > 0.90
124
  - Amphibia, Insecta, Mammalia as 0 in 2025 data
125
 
126
  Non-2025 model
fit_model.ipynb ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "2b5565b0-0eda-4961-9065-b3e56d683baa",
7
+ "metadata": {
8
+ "editable": true,
9
+ "execution": {
10
+ "iopub.execute_input": "2025-08-21T22:03:54.063042Z",
11
+ "iopub.status.busy": "2025-08-21T22:03:54.062823Z",
12
+ "iopub.status.idle": "2025-08-21T22:03:54.067443Z",
13
+ "shell.execute_reply": "2025-08-21T22:03:54.066939Z"
14
+ },
15
+ "papermill": {
16
+ "duration": 0.013166,
17
+ "end_time": "2025-08-21T22:03:54.068798",
18
+ "exception": false,
19
+ "start_time": "2025-08-21T22:03:54.055632",
20
+ "status": "completed"
21
+ },
22
+ "slideshow": {
23
+ "slide_type": ""
24
+ },
25
+ "tags": [
26
+ "parameters"
27
+ ]
28
+ },
29
+ "outputs": [],
30
+ "source": [
31
+ "input_file = None\n",
32
+ "output_file = None\n",
33
+ "n_estimators = None\n",
34
+ "max_depth = None\n",
35
+ "random_state = None"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": null,
41
+ "id": "3c36001f-b354-4457-95b4-01953533dbaa",
42
+ "metadata": {
43
+ "execution": {
44
+ "iopub.execute_input": "2025-08-21T22:03:54.091554Z",
45
+ "iopub.status.busy": "2025-08-21T22:03:54.091171Z",
46
+ "iopub.status.idle": "2025-08-21T22:03:57.814895Z",
47
+ "shell.execute_reply": "2025-08-21T22:03:57.814474Z"
48
+ },
49
+ "papermill": {
50
+ "duration": 3.728978,
51
+ "end_time": "2025-08-21T22:03:57.816040",
52
+ "exception": false,
53
+ "start_time": "2025-08-21T22:03:54.087062",
54
+ "status": "completed"
55
+ },
56
+ "tags": []
57
+ },
58
+ "outputs": [],
59
+ "source": [
60
+ "import pandas as pd\n",
61
+ "import numpy as np\n",
62
+ "import xgboost as xgb\n",
63
+ "import sklearn"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "execution_count": null,
69
+ "id": "82f8195d-236a-4288-89fc-4952b377f0cc",
70
+ "metadata": {
71
+ "execution": {
72
+ "iopub.execute_input": "2025-08-21T22:03:57.825272Z",
73
+ "iopub.status.busy": "2025-08-21T22:03:57.824333Z",
74
+ "iopub.status.idle": "2025-08-21T22:04:08.809191Z",
75
+ "shell.execute_reply": "2025-08-21T22:04:08.808791Z"
76
+ },
77
+ "papermill": {
78
+ "duration": 10.990556,
79
+ "end_time": "2025-08-21T22:04:08.810302",
80
+ "exception": false,
81
+ "start_time": "2025-08-21T22:03:57.819746",
82
+ "status": "completed"
83
+ },
84
+ "tags": []
85
+ },
86
+ "outputs": [],
87
+ "source": [
88
+ "# load data\n",
89
+ "combined = pd.read_csv(input_file,low_memory=False)\n",
90
+ "X = combined[[f'feature_{i}' for i in range(24)]].to_numpy()\n",
91
+ "y = combined['hasbird'].to_numpy()\n",
92
+ "del combined"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": null,
98
+ "id": "466457c2",
99
+ "metadata": {},
100
+ "outputs": [],
101
+ "source": [
102
+ "# define, fit, and save model\n",
103
+ "model = xgb.XGBClassifier(n_estimators=int(n_estimators),\n",
104
+ " max_depth=int(max_depth),\n",
105
+ " random_state=int(random_state))\n",
106
+ "model.fit(X, y)\n",
107
+ "model.save_model(output_file)"
108
+ ]
109
+ }
110
+ ],
111
+ "metadata": {
112
+ "kernelspec": {
113
+ "display_name": "Birdclef",
114
+ "language": "python",
115
+ "name": "birdclef"
116
+ },
117
+ "language_info": {
118
+ "codemirror_mode": {
119
+ "name": "ipython",
120
+ "version": 3
121
+ },
122
+ "file_extension": ".py",
123
+ "mimetype": "text/x-python",
124
+ "name": "python",
125
+ "nbconvert_exporter": "python",
126
+ "pygments_lexer": "ipython3",
127
+ "version": "3.12.11"
128
+ },
129
+ "papermill": {
130
+ "default_parameters": {},
131
+ "duration": 3943.115939,
132
+ "end_time": "2025-08-21T23:09:35.210970",
133
+ "environment_variables": {},
134
+ "exception": null,
135
+ "input_path": "fit_model.ipynb",
136
+ "output_path": "ran/fit_model.ipynb",
137
+ "parameters": {
138
+ "input_file": "xgb_rnd3_next.csv",
139
+ "output_file": "third_model_results.csv"
140
+ },
141
+ "start_time": "2025-08-21T22:03:52.095031",
142
+ "version": "2.6.0"
143
+ }
144
+ },
145
+ "nbformat": 4,
146
+ "nbformat_minor": 5
147
+ }
fit_models.ipynb ADDED
@@ -0,0 +1,1045 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "2b5565b0-0eda-4961-9065-b3e56d683baa",
7
+ "metadata": {
8
+ "editable": true,
9
+ "execution": {
10
+ "iopub.execute_input": "2025-08-21T22:03:54.063042Z",
11
+ "iopub.status.busy": "2025-08-21T22:03:54.062823Z",
12
+ "iopub.status.idle": "2025-08-21T22:03:54.067443Z",
13
+ "shell.execute_reply": "2025-08-21T22:03:54.066939Z"
14
+ },
15
+ "papermill": {
16
+ "duration": 0.013166,
17
+ "end_time": "2025-08-21T22:03:54.068798",
18
+ "exception": false,
19
+ "start_time": "2025-08-21T22:03:54.055632",
20
+ "status": "completed"
21
+ },
22
+ "slideshow": {
23
+ "slide_type": ""
24
+ },
25
+ "tags": [
26
+ "parameters"
27
+ ]
28
+ },
29
+ "outputs": [],
30
+ "source": [
31
+ "input_file = None\n",
32
+ "output_file = None"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": 2,
38
+ "id": "2af28fa1",
39
+ "metadata": {
40
+ "execution": {
41
+ "iopub.execute_input": "2025-08-21T22:03:54.076410Z",
42
+ "iopub.status.busy": "2025-08-21T22:03:54.076201Z",
43
+ "iopub.status.idle": "2025-08-21T22:03:54.082455Z",
44
+ "shell.execute_reply": "2025-08-21T22:03:54.082034Z"
45
+ },
46
+ "papermill": {
47
+ "duration": 0.011177,
48
+ "end_time": "2025-08-21T22:03:54.083645",
49
+ "exception": false,
50
+ "start_time": "2025-08-21T22:03:54.072468",
51
+ "status": "completed"
52
+ },
53
+ "tags": [
54
+ "injected-parameters"
55
+ ]
56
+ },
57
+ "outputs": [],
58
+ "source": [
59
+ "# Parameters\n",
60
+ "input_file = \"xgb_rnd3_next.csv\"\n",
61
+ "output_file = \"third_model_results.csv\"\n"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": 3,
67
+ "id": "3c36001f-b354-4457-95b4-01953533dbaa",
68
+ "metadata": {
69
+ "execution": {
70
+ "iopub.execute_input": "2025-08-21T22:03:54.091554Z",
71
+ "iopub.status.busy": "2025-08-21T22:03:54.091171Z",
72
+ "iopub.status.idle": "2025-08-21T22:03:57.814895Z",
73
+ "shell.execute_reply": "2025-08-21T22:03:57.814474Z"
74
+ },
75
+ "papermill": {
76
+ "duration": 3.728978,
77
+ "end_time": "2025-08-21T22:03:57.816040",
78
+ "exception": false,
79
+ "start_time": "2025-08-21T22:03:54.087062",
80
+ "status": "completed"
81
+ },
82
+ "tags": []
83
+ },
84
+ "outputs": [],
85
+ "source": [
86
+ "import pandas as pd\n",
87
+ "import os\n",
88
+ "import numpy as np\n",
89
+ "import matplotlib.pyplot as plt\n",
90
+ "import itertools\n",
91
+ "import copy\n",
92
+ "\n",
93
+ "import xgboost as xgb\n",
94
+ "import sklearn\n",
95
+ "from sklearn.model_selection import train_test_split, GridSearchCV\n",
96
+ "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve"
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": 4,
102
+ "id": "82f8195d-236a-4288-89fc-4952b377f0cc",
103
+ "metadata": {
104
+ "execution": {
105
+ "iopub.execute_input": "2025-08-21T22:03:57.825272Z",
106
+ "iopub.status.busy": "2025-08-21T22:03:57.824333Z",
107
+ "iopub.status.idle": "2025-08-21T22:04:08.809191Z",
108
+ "shell.execute_reply": "2025-08-21T22:04:08.808791Z"
109
+ },
110
+ "papermill": {
111
+ "duration": 10.990556,
112
+ "end_time": "2025-08-21T22:04:08.810302",
113
+ "exception": false,
114
+ "start_time": "2025-08-21T22:03:57.819746",
115
+ "status": "completed"
116
+ },
117
+ "tags": []
118
+ },
119
+ "outputs": [],
120
+ "source": [
121
+ "# combined = pd.read_csv('xgb_rnd3_next.csv',low_memory=False)\n",
122
+ "combined = pd.read_csv(input_file,low_memory=False)\n",
123
+ "X = combined[[f'feature_{i}' for i in range(24)]].to_numpy()\n",
124
+ "y = combined['hasbird'].to_numpy()\n",
125
+ "del combined"
126
+ ]
127
+ },
128
+ {
129
+ "cell_type": "code",
130
+ "execution_count": 5,
131
+ "id": "235a0d9e-5476-43e7-ada3-96f9fe1254ff",
132
+ "metadata": {
133
+ "execution": {
134
+ "iopub.execute_input": "2025-08-21T22:04:08.819492Z",
135
+ "iopub.status.busy": "2025-08-21T22:04:08.818725Z",
136
+ "iopub.status.idle": "2025-08-21T22:04:08.822610Z",
137
+ "shell.execute_reply": "2025-08-21T22:04:08.822231Z"
138
+ },
139
+ "papermill": {
140
+ "duration": 0.009105,
141
+ "end_time": "2025-08-21T22:04:08.823514",
142
+ "exception": false,
143
+ "start_time": "2025-08-21T22:04:08.814409",
144
+ "status": "completed"
145
+ },
146
+ "tags": []
147
+ },
148
+ "outputs": [],
149
+ "source": [
150
+ "grid_params = {\n",
151
+ " 'model': [\n",
152
+ " xgb.XGBClassifier,\n",
153
+ " ],\n",
154
+ " 'n_estimators': [10, 20, 50,],\n",
155
+ " 'max_depth': [5, 10, 20,],\n",
156
+ " # 'n_estimators': [5,],\n",
157
+ " # 'max_depth': [2, 5,],\n",
158
+ "}\n",
159
+ "param_combos = list(itertools.product(*grid_params.values()))\n",
160
+ "param_combos = [{k: v for k, v in zip(grid_params.keys(), combination)} for combination in param_combos]"
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "code",
165
+ "execution_count": 6,
166
+ "id": "7aeeb04b-9884-4780-b17b-882ac06316dc",
167
+ "metadata": {
168
+ "execution": {
169
+ "iopub.execute_input": "2025-08-21T22:04:08.831773Z",
170
+ "iopub.status.busy": "2025-08-21T22:04:08.831007Z",
171
+ "iopub.status.idle": "2025-08-21T23:09:34.542148Z",
172
+ "shell.execute_reply": "2025-08-21T23:09:34.541580Z"
173
+ },
174
+ "papermill": {
175
+ "duration": 3925.716838,
176
+ "end_time": "2025-08-21T23:09:34.543697",
177
+ "exception": false,
178
+ "start_time": "2025-08-21T22:04:08.826859",
179
+ "status": "completed"
180
+ },
181
+ "tags": []
182
+ },
183
+ "outputs": [
184
+ {
185
+ "name": "stdout",
186
+ "output_type": "stream",
187
+ "text": [
188
+ "Fit 0\n"
189
+ ]
190
+ },
191
+ {
192
+ "name": "stdout",
193
+ "output_type": "stream",
194
+ "text": [
195
+ "Fit 1\n"
196
+ ]
197
+ },
198
+ {
199
+ "name": "stdout",
200
+ "output_type": "stream",
201
+ "text": [
202
+ "Fit 2\n"
203
+ ]
204
+ },
205
+ {
206
+ "name": "stdout",
207
+ "output_type": "stream",
208
+ "text": [
209
+ "Fit 3\n"
210
+ ]
211
+ },
212
+ {
213
+ "name": "stdout",
214
+ "output_type": "stream",
215
+ "text": [
216
+ "Fit 4\n"
217
+ ]
218
+ },
219
+ {
220
+ "name": "stdout",
221
+ "output_type": "stream",
222
+ "text": [
223
+ "Fit 5\n"
224
+ ]
225
+ },
226
+ {
227
+ "name": "stdout",
228
+ "output_type": "stream",
229
+ "text": [
230
+ "Fit 6\n"
231
+ ]
232
+ },
233
+ {
234
+ "name": "stdout",
235
+ "output_type": "stream",
236
+ "text": [
237
+ "Fit 7\n"
238
+ ]
239
+ },
240
+ {
241
+ "name": "stdout",
242
+ "output_type": "stream",
243
+ "text": [
244
+ "Fit 8\n"
245
+ ]
246
+ },
247
+ {
248
+ "name": "stdout",
249
+ "output_type": "stream",
250
+ "text": [
251
+ "Fit 9\n"
252
+ ]
253
+ },
254
+ {
255
+ "data": {
256
+ "text/html": [
257
+ "<div>\n",
258
+ "<style scoped>\n",
259
+ " .dataframe tbody tr th:only-of-type {\n",
260
+ " vertical-align: middle;\n",
261
+ " }\n",
262
+ "\n",
263
+ " .dataframe tbody tr th {\n",
264
+ " vertical-align: top;\n",
265
+ " }\n",
266
+ "\n",
267
+ " .dataframe thead th {\n",
268
+ " text-align: right;\n",
269
+ " }\n",
270
+ "</style>\n",
271
+ "<table border=\"1\" class=\"dataframe\">\n",
272
+ " <thead>\n",
273
+ " <tr style=\"text-align: right;\">\n",
274
+ " <th></th>\n",
275
+ " <th>model</th>\n",
276
+ " <th>n_estimators</th>\n",
277
+ " <th>max_depth</th>\n",
278
+ " <th>random_state</th>\n",
279
+ " <th>test_size</th>\n",
280
+ " <th>random_state_split</th>\n",
281
+ " <th>test_accuracy</th>\n",
282
+ " <th>test_precision</th>\n",
283
+ " <th>test_recall</th>\n",
284
+ " <th>test_f1</th>\n",
285
+ " <th>...</th>\n",
286
+ " <th>var14</th>\n",
287
+ " <th>var15</th>\n",
288
+ " <th>var16</th>\n",
289
+ " <th>var17</th>\n",
290
+ " <th>var18</th>\n",
291
+ " <th>var19</th>\n",
292
+ " <th>var20</th>\n",
293
+ " <th>var21</th>\n",
294
+ " <th>var22</th>\n",
295
+ " <th>var23</th>\n",
296
+ " </tr>\n",
297
+ " </thead>\n",
298
+ " <tbody>\n",
299
+ " <tr>\n",
300
+ " <th>0</th>\n",
301
+ " <td>XGBClassifier</td>\n",
302
+ " <td>10</td>\n",
303
+ " <td>5</td>\n",
304
+ " <td>62618</td>\n",
305
+ " <td>0.25</td>\n",
306
+ " <td>8564</td>\n",
307
+ " <td>0.919485</td>\n",
308
+ " <td>0.979252</td>\n",
309
+ " <td>0.930664</td>\n",
310
+ " <td>0.954340</td>\n",
311
+ " <td>...</td>\n",
312
+ " <td>0.016507</td>\n",
313
+ " <td>0.049293</td>\n",
314
+ " <td>0.008913</td>\n",
315
+ " <td>0.008224</td>\n",
316
+ " <td>0.007370</td>\n",
317
+ " <td>0.017610</td>\n",
318
+ " <td>0.001872</td>\n",
319
+ " <td>0.015215</td>\n",
320
+ " <td>0.010002</td>\n",
321
+ " <td>0.011387</td>\n",
322
+ " </tr>\n",
323
+ " <tr>\n",
324
+ " <th>1</th>\n",
325
+ " <td>XGBClassifier</td>\n",
326
+ " <td>10</td>\n",
327
+ " <td>5</td>\n",
328
+ " <td>38092</td>\n",
329
+ " <td>0.25</td>\n",
330
+ " <td>29471</td>\n",
331
+ " <td>0.916542</td>\n",
332
+ " <td>0.979042</td>\n",
333
+ " <td>0.927811</td>\n",
334
+ " <td>0.952738</td>\n",
335
+ " <td>...</td>\n",
336
+ " <td>0.014524</td>\n",
337
+ " <td>0.041489</td>\n",
338
+ " <td>0.009664</td>\n",
339
+ " <td>0.007179</td>\n",
340
+ " <td>0.010653</td>\n",
341
+ " <td>0.015027</td>\n",
342
+ " <td>0.000000</td>\n",
343
+ " <td>0.012734</td>\n",
344
+ " <td>0.009035</td>\n",
345
+ " <td>0.009792</td>\n",
346
+ " </tr>\n",
347
+ " <tr>\n",
348
+ " <th>2</th>\n",
349
+ " <td>XGBClassifier</td>\n",
350
+ " <td>10</td>\n",
351
+ " <td>5</td>\n",
352
+ " <td>53379</td>\n",
353
+ " <td>0.25</td>\n",
354
+ " <td>53105</td>\n",
355
+ " <td>0.920031</td>\n",
356
+ " <td>0.977984</td>\n",
357
+ " <td>0.932223</td>\n",
358
+ " <td>0.954555</td>\n",
359
+ " <td>...</td>\n",
360
+ " <td>0.029407</td>\n",
361
+ " <td>0.042983</td>\n",
362
+ " <td>0.011433</td>\n",
363
+ " <td>0.007371</td>\n",
364
+ " <td>0.007829</td>\n",
365
+ " <td>0.019937</td>\n",
366
+ " <td>0.006534</td>\n",
367
+ " <td>0.014162</td>\n",
368
+ " <td>0.008016</td>\n",
369
+ " <td>0.010182</td>\n",
370
+ " </tr>\n",
371
+ " <tr>\n",
372
+ " <th>3</th>\n",
373
+ " <td>XGBClassifier</td>\n",
374
+ " <td>10</td>\n",
375
+ " <td>5</td>\n",
376
+ " <td>53990</td>\n",
377
+ " <td>0.25</td>\n",
378
+ " <td>1020</td>\n",
379
+ " <td>0.920008</td>\n",
380
+ " <td>0.977727</td>\n",
381
+ " <td>0.932454</td>\n",
382
+ " <td>0.954554</td>\n",
383
+ " <td>...</td>\n",
384
+ " <td>0.017415</td>\n",
385
+ " <td>0.040748</td>\n",
386
+ " <td>0.009362</td>\n",
387
+ " <td>0.009092</td>\n",
388
+ " <td>0.004017</td>\n",
389
+ " <td>0.019394</td>\n",
390
+ " <td>0.000000</td>\n",
391
+ " <td>0.013799</td>\n",
392
+ " <td>0.011235</td>\n",
393
+ " <td>0.011108</td>\n",
394
+ " </tr>\n",
395
+ " <tr>\n",
396
+ " <th>4</th>\n",
397
+ " <td>XGBClassifier</td>\n",
398
+ " <td>10</td>\n",
399
+ " <td>5</td>\n",
400
+ " <td>20157</td>\n",
401
+ " <td>0.25</td>\n",
402
+ " <td>8650</td>\n",
403
+ " <td>0.917305</td>\n",
404
+ " <td>0.978153</td>\n",
405
+ " <td>0.929241</td>\n",
406
+ " <td>0.953070</td>\n",
407
+ " <td>...</td>\n",
408
+ " <td>0.019698</td>\n",
409
+ " <td>0.052578</td>\n",
410
+ " <td>0.011287</td>\n",
411
+ " <td>0.006563</td>\n",
412
+ " <td>0.005683</td>\n",
413
+ " <td>0.013883</td>\n",
414
+ " <td>0.008399</td>\n",
415
+ " <td>0.010119</td>\n",
416
+ " <td>0.008999</td>\n",
417
+ " <td>0.007195</td>\n",
418
+ " </tr>\n",
419
+ " <tr>\n",
420
+ " <th>5</th>\n",
421
+ " <td>XGBClassifier</td>\n",
422
+ " <td>10</td>\n",
423
+ " <td>5</td>\n",
424
+ " <td>29087</td>\n",
425
+ " <td>0.25</td>\n",
426
+ " <td>15247</td>\n",
427
+ " <td>0.920376</td>\n",
428
+ " <td>0.978004</td>\n",
429
+ " <td>0.932595</td>\n",
430
+ " <td>0.954760</td>\n",
431
+ " <td>...</td>\n",
432
+ " <td>0.014503</td>\n",
433
+ " <td>0.042829</td>\n",
434
+ " <td>0.009559</td>\n",
435
+ " <td>0.006757</td>\n",
436
+ " <td>0.010323</td>\n",
437
+ " <td>0.013950</td>\n",
438
+ " <td>0.007816</td>\n",
439
+ " <td>0.010859</td>\n",
440
+ " <td>0.013720</td>\n",
441
+ " <td>0.011215</td>\n",
442
+ " </tr>\n",
443
+ " <tr>\n",
444
+ " <th>6</th>\n",
445
+ " <td>XGBClassifier</td>\n",
446
+ " <td>10</td>\n",
447
+ " <td>5</td>\n",
448
+ " <td>63289</td>\n",
449
+ " <td>0.25</td>\n",
450
+ " <td>37405</td>\n",
451
+ " <td>0.918176</td>\n",
452
+ " <td>0.977817</td>\n",
453
+ " <td>0.930484</td>\n",
454
+ " <td>0.953563</td>\n",
455
+ " <td>...</td>\n",
456
+ " <td>0.016545</td>\n",
457
+ " <td>0.049997</td>\n",
458
+ " <td>0.009340</td>\n",
459
+ " <td>0.006492</td>\n",
460
+ " <td>0.005942</td>\n",
461
+ " <td>0.013955</td>\n",
462
+ " <td>0.005050</td>\n",
463
+ " <td>0.008405</td>\n",
464
+ " <td>0.009122</td>\n",
465
+ " <td>0.010000</td>\n",
466
+ " </tr>\n",
467
+ " </tbody>\n",
468
+ "</table>\n",
469
+ "<p>7 rows × 42 columns</p>\n",
470
+ "</div>"
471
+ ],
472
+ "text/plain": [
473
+ " model n_estimators max_depth random_state test_size \\\n",
474
+ "0 XGBClassifier 10 5 62618 0.25 \n",
475
+ "1 XGBClassifier 10 5 38092 0.25 \n",
476
+ "2 XGBClassifier 10 5 53379 0.25 \n",
477
+ "3 XGBClassifier 10 5 53990 0.25 \n",
478
+ "4 XGBClassifier 10 5 20157 0.25 \n",
479
+ "5 XGBClassifier 10 5 29087 0.25 \n",
480
+ "6 XGBClassifier 10 5 63289 0.25 \n",
481
+ "\n",
482
+ " random_state_split test_accuracy test_precision test_recall test_f1 \\\n",
483
+ "0 8564 0.919485 0.979252 0.930664 0.954340 \n",
484
+ "1 29471 0.916542 0.979042 0.927811 0.952738 \n",
485
+ "2 53105 0.920031 0.977984 0.932223 0.954555 \n",
486
+ "3 1020 0.920008 0.977727 0.932454 0.954554 \n",
487
+ "4 8650 0.917305 0.978153 0.929241 0.953070 \n",
488
+ "5 15247 0.920376 0.978004 0.932595 0.954760 \n",
489
+ "6 37405 0.918176 0.977817 0.930484 0.953563 \n",
490
+ "\n",
491
+ " ... var14 var15 var16 var17 var18 var19 var20 \\\n",
492
+ "0 ... 0.016507 0.049293 0.008913 0.008224 0.007370 0.017610 0.001872 \n",
493
+ "1 ... 0.014524 0.041489 0.009664 0.007179 0.010653 0.015027 0.000000 \n",
494
+ "2 ... 0.029407 0.042983 0.011433 0.007371 0.007829 0.019937 0.006534 \n",
495
+ "3 ... 0.017415 0.040748 0.009362 0.009092 0.004017 0.019394 0.000000 \n",
496
+ "4 ... 0.019698 0.052578 0.011287 0.006563 0.005683 0.013883 0.008399 \n",
497
+ "5 ... 0.014503 0.042829 0.009559 0.006757 0.010323 0.013950 0.007816 \n",
498
+ "6 ... 0.016545 0.049997 0.009340 0.006492 0.005942 0.013955 0.005050 \n",
499
+ "\n",
500
+ " var21 var22 var23 \n",
501
+ "0 0.015215 0.010002 0.011387 \n",
502
+ "1 0.012734 0.009035 0.009792 \n",
503
+ "2 0.014162 0.008016 0.010182 \n",
504
+ "3 0.013799 0.011235 0.011108 \n",
505
+ "4 0.010119 0.008999 0.007195 \n",
506
+ "5 0.010859 0.013720 0.011215 \n",
507
+ "6 0.008405 0.009122 0.010000 \n",
508
+ "\n",
509
+ "[7 rows x 42 columns]"
510
+ ]
511
+ },
512
+ "execution_count": 6,
513
+ "metadata": {},
514
+ "output_type": "execute_result"
515
+ }
516
+ ],
517
+ "source": [
518
+ "# how many duplicated runs of the models\n",
519
+ "# for averaging results\n",
520
+ "num_fits = 10\n",
521
+ "test_size = 0.25\n",
522
+ "\n",
523
+ "param_dicts = []\n",
524
+ "report_dicts = []\n",
525
+ "fi_dicts = []\n",
526
+ "itr = 0\n",
527
+ "\n",
528
+ "for i in range(num_fits):\n",
529
+ "\n",
530
+ " # randomize the train test split\n",
531
+ " rsp = np.random.randint(0,2**16)\n",
532
+ " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=rsp)\n",
533
+ "\n",
534
+ " print(f\"Fit {i}\")\n",
535
+ " \n",
536
+ " for _ in param_combos:\n",
537
+ " \n",
538
+ " itr += 1\n",
539
+ " \n",
540
+ " # manipulate the parameter combinations\n",
541
+ " rs = np.random.randint(0,2**16)\n",
542
+ " d = {k:v for k,v in _.items() if k!='model'}\n",
543
+ " d['random_state'] = rs\n",
544
+ " \n",
545
+ " # fit the model\n",
546
+ " model = _['model'](**d)\n",
547
+ " model.fit(X_train, y_train)\n",
548
+ " y_test_pred = model.predict(X_test)\n",
549
+ " y_test_pred_proba = model.predict_proba(X_test)\n",
550
+ " y_train_pred = model.predict(X_train)\n",
551
+ " y_train_pred_proba = model.predict_proba(X_train)\n",
552
+ " \n",
553
+ " # parameters dictionary\n",
554
+ " param_dict = copy.deepcopy(d)\n",
555
+ " if type(model) == type(xgb.XGBClassifier()):\n",
556
+ " param_dict['model'] = 'XGBClassifier'\n",
557
+ " elif type(model) == type(sklearn.ensemble.RandomForestClassifier()):\n",
558
+ " param_dict['model'] = 'RandomForestClassifier'\n",
559
+ " elif type(model) == type(sklearn.ensemble.GradientBoostingClassifier()):\n",
560
+ " param_dict['model'] = 'GradientBoostingClassifier'\n",
561
+ " param_dict['unique_id'] = itr\n",
562
+ " param_dict['test_size'] = test_size\n",
563
+ " param_dict['random_state_split'] = rsp\n",
564
+ " param_dicts.append(param_dict)\n",
565
+ " \n",
566
+ " # report dictionary to compute and save\n",
567
+ " report_dict = {\n",
568
+ " 'unique_id' : itr,\n",
569
+ " 'test_accuracy' : sklearn.metrics.accuracy_score(y_test_pred, y_test),\n",
570
+ " 'test_precision' : sklearn.metrics.precision_score(y_test_pred, y_test),\n",
571
+ " 'test_recall' : sklearn.metrics.recall_score(y_test_pred, y_test),\n",
572
+ " 'test_f1' : sklearn.metrics.f1_score(y_test_pred, y_test),\n",
573
+ " 'test_auroc' : sklearn.metrics.roc_auc_score(y_test, y_test_pred_proba[:,1]),\n",
574
+ " 'test_log_loss' : sklearn.metrics.log_loss(y_test, y_test_pred_proba[:,1]),\n",
575
+ " 'train_accuracy' : sklearn.metrics.accuracy_score(y_train_pred, y_train),\n",
576
+ " 'train_precision' : sklearn.metrics.precision_score(y_train_pred, y_train),\n",
577
+ " 'train_recall' : sklearn.metrics.recall_score(y_train_pred, y_train),\n",
578
+ " 'train_f1' : sklearn.metrics.f1_score(y_train_pred, y_train),\n",
579
+ " 'train_auroc' : sklearn.metrics.roc_auc_score(y_train, y_train_pred_proba[:,1]),\n",
580
+ " 'train_log_loss' : sklearn.metrics.log_loss(y_train, y_train_pred_proba[:,1]),\n",
581
+ " }\n",
582
+ " report_dicts.append(report_dict)\n",
583
+ "\n",
584
+ " # record feature importances\n",
585
+ " fi = model.feature_importances_\n",
586
+ " fi_dict = {'var'+str(i) : float(j) for i,j in enumerate(model.feature_importances_)}\n",
587
+ " fi_dict['unique_id'] = itr\n",
588
+ " fi_dicts.append(fi_dict)\n",
589
+ "\n",
590
+ "fi_table = pd.DataFrame(fi_dicts)\n",
591
+ "report_table = pd.DataFrame(report_dicts)\n",
592
+ "param_table = pd.DataFrame(param_dicts)\n",
593
+ "merged_table = pd.merge(param_table, report_table, on='unique_id')[['model','unique_id'] + \\\n",
594
+ " [k for k in param_dict.keys() if k not in ['model','unique_id']] + \\\n",
595
+ " [k for k in report_dict.keys() if k != 'unique_id']\n",
596
+ "]\n",
597
+ "merged_table = pd.merge(merged_table, fi_table, on='unique_id')\n",
598
+ "merged_table.drop('unique_id',axis=1,inplace=True)\n",
599
+ "merged_table.sort_values(by=['model'] + [k for k in d.keys() if k != 'random_state'],inplace=True)\n",
600
+ "merged_table.reset_index(inplace=True,drop=True)\n",
601
+ "merged_table.head(7)"
602
+ ]
603
+ },
604
+ {
605
+ "cell_type": "code",
606
+ "execution_count": 7,
607
+ "id": "64953ef5-5206-4d47-9519-311c858eb8a1",
608
+ "metadata": {
609
+ "execution": {
610
+ "iopub.execute_input": "2025-08-21T23:09:34.553963Z",
611
+ "iopub.status.busy": "2025-08-21T23:09:34.553337Z",
612
+ "iopub.status.idle": "2025-08-21T23:09:34.566190Z",
613
+ "shell.execute_reply": "2025-08-21T23:09:34.565837Z"
614
+ },
615
+ "papermill": {
616
+ "duration": 0.0184,
617
+ "end_time": "2025-08-21T23:09:34.566962",
618
+ "exception": false,
619
+ "start_time": "2025-08-21T23:09:34.548562",
620
+ "status": "completed"
621
+ },
622
+ "tags": []
623
+ },
624
+ "outputs": [
625
+ {
626
+ "data": {
627
+ "text/html": [
628
+ "<div>\n",
629
+ "<style scoped>\n",
630
+ " .dataframe tbody tr th:only-of-type {\n",
631
+ " vertical-align: middle;\n",
632
+ " }\n",
633
+ "\n",
634
+ " .dataframe tbody tr th {\n",
635
+ " vertical-align: top;\n",
636
+ " }\n",
637
+ "\n",
638
+ " .dataframe thead th {\n",
639
+ " text-align: right;\n",
640
+ " }\n",
641
+ "</style>\n",
642
+ "<table border=\"1\" class=\"dataframe\">\n",
643
+ " <thead>\n",
644
+ " <tr style=\"text-align: right;\">\n",
645
+ " <th></th>\n",
646
+ " <th>n_estimators</th>\n",
647
+ " <th>max_depth</th>\n",
648
+ " <th>test_accuracy</th>\n",
649
+ " <th>train_accuracy</th>\n",
650
+ " <th>test_f1</th>\n",
651
+ " <th>train_f1</th>\n",
652
+ " <th>test_precision</th>\n",
653
+ " <th>train_precision</th>\n",
654
+ " <th>test_recall</th>\n",
655
+ " <th>train_recall</th>\n",
656
+ " </tr>\n",
657
+ " </thead>\n",
658
+ " <tbody>\n",
659
+ " <tr>\n",
660
+ " <th>0</th>\n",
661
+ " <td>10</td>\n",
662
+ " <td>5</td>\n",
663
+ " <td>0.919485</td>\n",
664
+ " <td>0.919226</td>\n",
665
+ " <td>0.954340</td>\n",
666
+ " <td>0.954169</td>\n",
667
+ " <td>0.979252</td>\n",
668
+ " <td>0.979097</td>\n",
669
+ " <td>0.930664</td>\n",
670
+ " <td>0.930478</td>\n",
671
+ " </tr>\n",
672
+ " <tr>\n",
673
+ " <th>1</th>\n",
674
+ " <td>10</td>\n",
675
+ " <td>5</td>\n",
676
+ " <td>0.916542</td>\n",
677
+ " <td>0.916715</td>\n",
678
+ " <td>0.952738</td>\n",
679
+ " <td>0.952818</td>\n",
680
+ " <td>0.979042</td>\n",
681
+ " <td>0.979237</td>\n",
682
+ " <td>0.927811</td>\n",
683
+ " <td>0.927787</td>\n",
684
+ " </tr>\n",
685
+ " <tr>\n",
686
+ " <th>2</th>\n",
687
+ " <td>10</td>\n",
688
+ " <td>5</td>\n",
689
+ " <td>0.920031</td>\n",
690
+ " <td>0.919793</td>\n",
691
+ " <td>0.954555</td>\n",
692
+ " <td>0.954422</td>\n",
693
+ " <td>0.977984</td>\n",
694
+ " <td>0.977704</td>\n",
695
+ " <td>0.932223</td>\n",
696
+ " <td>0.932222</td>\n",
697
+ " </tr>\n",
698
+ " <tr>\n",
699
+ " <th>3</th>\n",
700
+ " <td>10</td>\n",
701
+ " <td>5</td>\n",
702
+ " <td>0.920008</td>\n",
703
+ " <td>0.919787</td>\n",
704
+ " <td>0.954554</td>\n",
705
+ " <td>0.954414</td>\n",
706
+ " <td>0.977727</td>\n",
707
+ " <td>0.977783</td>\n",
708
+ " <td>0.932454</td>\n",
709
+ " <td>0.932137</td>\n",
710
+ " </tr>\n",
711
+ " <tr>\n",
712
+ " <th>4</th>\n",
713
+ " <td>10</td>\n",
714
+ " <td>5</td>\n",
715
+ " <td>0.917305</td>\n",
716
+ " <td>0.917457</td>\n",
717
+ " <td>0.953070</td>\n",
718
+ " <td>0.953178</td>\n",
719
+ " <td>0.978153</td>\n",
720
+ " <td>0.978051</td>\n",
721
+ " <td>0.929241</td>\n",
722
+ " <td>0.929539</td>\n",
723
+ " </tr>\n",
724
+ " <tr>\n",
725
+ " <th>5</th>\n",
726
+ " <td>10</td>\n",
727
+ " <td>5</td>\n",
728
+ " <td>0.920376</td>\n",
729
+ " <td>0.920645</td>\n",
730
+ " <td>0.954760</td>\n",
731
+ " <td>0.954898</td>\n",
732
+ " <td>0.978004</td>\n",
733
+ " <td>0.978137</td>\n",
734
+ " <td>0.932595</td>\n",
735
+ " <td>0.932737</td>\n",
736
+ " </tr>\n",
737
+ " <tr>\n",
738
+ " <th>6</th>\n",
739
+ " <td>10</td>\n",
740
+ " <td>5</td>\n",
741
+ " <td>0.918176</td>\n",
742
+ " <td>0.918210</td>\n",
743
+ " <td>0.953563</td>\n",
744
+ " <td>0.953576</td>\n",
745
+ " <td>0.977817</td>\n",
746
+ " <td>0.978125</td>\n",
747
+ " <td>0.930484</td>\n",
748
+ " <td>0.930228</td>\n",
749
+ " </tr>\n",
750
+ " <tr>\n",
751
+ " <th>7</th>\n",
752
+ " <td>10</td>\n",
753
+ " <td>5</td>\n",
754
+ " <td>0.918414</td>\n",
755
+ " <td>0.918595</td>\n",
756
+ " <td>0.953670</td>\n",
757
+ " <td>0.953794</td>\n",
758
+ " <td>0.977988</td>\n",
759
+ " <td>0.978111</td>\n",
760
+ " <td>0.930531</td>\n",
761
+ " <td>0.930657</td>\n",
762
+ " </tr>\n",
763
+ " <tr>\n",
764
+ " <th>8</th>\n",
765
+ " <td>10</td>\n",
766
+ " <td>5</td>\n",
767
+ " <td>0.918874</td>\n",
768
+ " <td>0.918810</td>\n",
769
+ " <td>0.953964</td>\n",
770
+ " <td>0.953909</td>\n",
771
+ " <td>0.978158</td>\n",
772
+ " <td>0.978360</td>\n",
773
+ " <td>0.930937</td>\n",
774
+ " <td>0.930651</td>\n",
775
+ " </tr>\n",
776
+ " <tr>\n",
777
+ " <th>9</th>\n",
778
+ " <td>10</td>\n",
779
+ " <td>5</td>\n",
780
+ " <td>0.919651</td>\n",
781
+ " <td>0.919729</td>\n",
782
+ " <td>0.954408</td>\n",
783
+ " <td>0.954418</td>\n",
784
+ " <td>0.978668</td>\n",
785
+ " <td>0.978603</td>\n",
786
+ " <td>0.931322</td>\n",
787
+ " <td>0.931399</td>\n",
788
+ " </tr>\n",
789
+ " <tr>\n",
790
+ " <th>10</th>\n",
791
+ " <td>10</td>\n",
792
+ " <td>10</td>\n",
793
+ " <td>0.947473</td>\n",
794
+ " <td>0.949585</td>\n",
795
+ " <td>0.969781</td>\n",
796
+ " <td>0.970971</td>\n",
797
+ " <td>0.980907</td>\n",
798
+ " <td>0.981816</td>\n",
799
+ " <td>0.958904</td>\n",
800
+ " <td>0.960363</td>\n",
801
+ " </tr>\n",
802
+ " <tr>\n",
803
+ " <th>11</th>\n",
804
+ " <td>10</td>\n",
805
+ " <td>10</td>\n",
806
+ " <td>0.947651</td>\n",
807
+ " <td>0.949634</td>\n",
808
+ " <td>0.969879</td>\n",
809
+ " <td>0.970994</td>\n",
810
+ " <td>0.980909</td>\n",
811
+ " <td>0.981657</td>\n",
812
+ " <td>0.959094</td>\n",
813
+ " <td>0.960561</td>\n",
814
+ " </tr>\n",
815
+ " <tr>\n",
816
+ " <th>12</th>\n",
817
+ " <td>10</td>\n",
818
+ " <td>10</td>\n",
819
+ " <td>0.947336</td>\n",
820
+ " <td>0.949079</td>\n",
821
+ " <td>0.969703</td>\n",
822
+ " <td>0.970703</td>\n",
823
+ " <td>0.981415</td>\n",
824
+ " <td>0.982128</td>\n",
825
+ " <td>0.958268</td>\n",
826
+ " <td>0.959541</td>\n",
827
+ " </tr>\n",
828
+ " <tr>\n",
829
+ " <th>13</th>\n",
830
+ " <td>10</td>\n",
831
+ " <td>10</td>\n",
832
+ " <td>0.948380</td>\n",
833
+ " <td>0.950519</td>\n",
834
+ " <td>0.970292</td>\n",
835
+ " <td>0.971503</td>\n",
836
+ " <td>0.981094</td>\n",
837
+ " <td>0.982121</td>\n",
838
+ " <td>0.959725</td>\n",
839
+ " <td>0.961112</td>\n",
840
+ " </tr>\n",
841
+ " <tr>\n",
842
+ " <th>14</th>\n",
843
+ " <td>10</td>\n",
844
+ " <td>10</td>\n",
845
+ " <td>0.946944</td>\n",
846
+ " <td>0.949269</td>\n",
847
+ " <td>0.969474</td>\n",
848
+ " <td>0.970812</td>\n",
849
+ " <td>0.981403</td>\n",
850
+ " <td>0.982112</td>\n",
851
+ " <td>0.957831</td>\n",
852
+ " <td>0.959769</td>\n",
853
+ " </tr>\n",
854
+ " <tr>\n",
855
+ " <th>15</th>\n",
856
+ " <td>10</td>\n",
857
+ " <td>10</td>\n",
858
+ " <td>0.947763</td>\n",
859
+ " <td>0.949670</td>\n",
860
+ " <td>0.969946</td>\n",
861
+ " <td>0.971027</td>\n",
862
+ " <td>0.981195</td>\n",
863
+ " <td>0.982040</td>\n",
864
+ " <td>0.958953</td>\n",
865
+ " <td>0.960259</td>\n",
866
+ " </tr>\n",
867
+ " <tr>\n",
868
+ " <th>16</th>\n",
869
+ " <td>10</td>\n",
870
+ " <td>10</td>\n",
871
+ " <td>0.947479</td>\n",
872
+ " <td>0.949631</td>\n",
873
+ " <td>0.969787</td>\n",
874
+ " <td>0.971009</td>\n",
875
+ " <td>0.981054</td>\n",
876
+ " <td>0.982229</td>\n",
877
+ " <td>0.958775</td>\n",
878
+ " <td>0.960043</td>\n",
879
+ " </tr>\n",
880
+ " <tr>\n",
881
+ " <th>17</th>\n",
882
+ " <td>10</td>\n",
883
+ " <td>10</td>\n",
884
+ " <td>0.946731</td>\n",
885
+ " <td>0.948403</td>\n",
886
+ " <td>0.969353</td>\n",
887
+ " <td>0.970316</td>\n",
888
+ " <td>0.981196</td>\n",
889
+ " <td>0.981756</td>\n",
890
+ " <td>0.957793</td>\n",
891
+ " <td>0.959140</td>\n",
892
+ " </tr>\n",
893
+ " <tr>\n",
894
+ " <th>18</th>\n",
895
+ " <td>10</td>\n",
896
+ " <td>10</td>\n",
897
+ " <td>0.947526</td>\n",
898
+ " <td>0.949605</td>\n",
899
+ " <td>0.969822</td>\n",
900
+ " <td>0.971000</td>\n",
901
+ " <td>0.981211</td>\n",
902
+ " <td>0.982460</td>\n",
903
+ " <td>0.958694</td>\n",
904
+ " <td>0.959805</td>\n",
905
+ " </tr>\n",
906
+ " <tr>\n",
907
+ " <th>19</th>\n",
908
+ " <td>10</td>\n",
909
+ " <td>10</td>\n",
910
+ " <td>0.947421</td>\n",
911
+ " <td>0.949485</td>\n",
912
+ " <td>0.969779</td>\n",
913
+ " <td>0.970932</td>\n",
914
+ " <td>0.981709</td>\n",
915
+ " <td>0.982450</td>\n",
916
+ " <td>0.958135</td>\n",
917
+ " <td>0.959682</td>\n",
918
+ " </tr>\n",
919
+ " </tbody>\n",
920
+ "</table>\n",
921
+ "</div>"
922
+ ],
923
+ "text/plain": [
924
+ " n_estimators max_depth test_accuracy train_accuracy test_f1 \\\n",
925
+ "0 10 5 0.919485 0.919226 0.954340 \n",
926
+ "1 10 5 0.916542 0.916715 0.952738 \n",
927
+ "2 10 5 0.920031 0.919793 0.954555 \n",
928
+ "3 10 5 0.920008 0.919787 0.954554 \n",
929
+ "4 10 5 0.917305 0.917457 0.953070 \n",
930
+ "5 10 5 0.920376 0.920645 0.954760 \n",
931
+ "6 10 5 0.918176 0.918210 0.953563 \n",
932
+ "7 10 5 0.918414 0.918595 0.953670 \n",
933
+ "8 10 5 0.918874 0.918810 0.953964 \n",
934
+ "9 10 5 0.919651 0.919729 0.954408 \n",
935
+ "10 10 10 0.947473 0.949585 0.969781 \n",
936
+ "11 10 10 0.947651 0.949634 0.969879 \n",
937
+ "12 10 10 0.947336 0.949079 0.969703 \n",
938
+ "13 10 10 0.948380 0.950519 0.970292 \n",
939
+ "14 10 10 0.946944 0.949269 0.969474 \n",
940
+ "15 10 10 0.947763 0.949670 0.969946 \n",
941
+ "16 10 10 0.947479 0.949631 0.969787 \n",
942
+ "17 10 10 0.946731 0.948403 0.969353 \n",
943
+ "18 10 10 0.947526 0.949605 0.969822 \n",
944
+ "19 10 10 0.947421 0.949485 0.969779 \n",
945
+ "\n",
946
+ " train_f1 test_precision train_precision test_recall train_recall \n",
947
+ "0 0.954169 0.979252 0.979097 0.930664 0.930478 \n",
948
+ "1 0.952818 0.979042 0.979237 0.927811 0.927787 \n",
949
+ "2 0.954422 0.977984 0.977704 0.932223 0.932222 \n",
950
+ "3 0.954414 0.977727 0.977783 0.932454 0.932137 \n",
951
+ "4 0.953178 0.978153 0.978051 0.929241 0.929539 \n",
952
+ "5 0.954898 0.978004 0.978137 0.932595 0.932737 \n",
953
+ "6 0.953576 0.977817 0.978125 0.930484 0.930228 \n",
954
+ "7 0.953794 0.977988 0.978111 0.930531 0.930657 \n",
955
+ "8 0.953909 0.978158 0.978360 0.930937 0.930651 \n",
956
+ "9 0.954418 0.978668 0.978603 0.931322 0.931399 \n",
957
+ "10 0.970971 0.980907 0.981816 0.958904 0.960363 \n",
958
+ "11 0.970994 0.980909 0.981657 0.959094 0.960561 \n",
959
+ "12 0.970703 0.981415 0.982128 0.958268 0.959541 \n",
960
+ "13 0.971503 0.981094 0.982121 0.959725 0.961112 \n",
961
+ "14 0.970812 0.981403 0.982112 0.957831 0.959769 \n",
962
+ "15 0.971027 0.981195 0.982040 0.958953 0.960259 \n",
963
+ "16 0.971009 0.981054 0.982229 0.958775 0.960043 \n",
964
+ "17 0.970316 0.981196 0.981756 0.957793 0.959140 \n",
965
+ "18 0.971000 0.981211 0.982460 0.958694 0.959805 \n",
966
+ "19 0.970932 0.981709 0.982450 0.958135 0.959682 "
967
+ ]
968
+ },
969
+ "execution_count": 7,
970
+ "metadata": {},
971
+ "output_type": "execute_result"
972
+ }
973
+ ],
974
+ "source": [
975
+ "merged_table.head(20)[['n_estimators', 'max_depth',\n",
976
+ " 'test_accuracy','train_accuracy',\n",
977
+ " 'test_f1','train_f1',\n",
978
+ " 'test_precision','train_precision',\n",
979
+ " 'test_recall','train_recall',]]"
980
+ ]
981
+ },
982
+ {
983
+ "cell_type": "code",
984
+ "execution_count": 8,
985
+ "id": "db7f4f13-a696-4314-b0f4-028852863573",
986
+ "metadata": {
987
+ "execution": {
988
+ "iopub.execute_input": "2025-08-21T23:09:34.576689Z",
989
+ "iopub.status.busy": "2025-08-21T23:09:34.576115Z",
990
+ "iopub.status.idle": "2025-08-21T23:09:34.590011Z",
991
+ "shell.execute_reply": "2025-08-21T23:09:34.589688Z"
992
+ },
993
+ "papermill": {
994
+ "duration": 0.019834,
995
+ "end_time": "2025-08-21T23:09:34.590874",
996
+ "exception": false,
997
+ "start_time": "2025-08-21T23:09:34.571040",
998
+ "status": "completed"
999
+ },
1000
+ "tags": []
1001
+ },
1002
+ "outputs": [],
1003
+ "source": [
1004
+ "# merged_table.to_csv('third_model_results.csv',index=False,header=True)\n",
1005
+ "merged_table.to_csv(output_file,index=False,header=True)"
1006
+ ]
1007
+ }
1008
+ ],
1009
+ "metadata": {
1010
+ "kernelspec": {
1011
+ "display_name": "Birdclef",
1012
+ "language": "python",
1013
+ "name": "birdclef"
1014
+ },
1015
+ "language_info": {
1016
+ "codemirror_mode": {
1017
+ "name": "ipython",
1018
+ "version": 3
1019
+ },
1020
+ "file_extension": ".py",
1021
+ "mimetype": "text/x-python",
1022
+ "name": "python",
1023
+ "nbconvert_exporter": "python",
1024
+ "pygments_lexer": "ipython3",
1025
+ "version": "3.12.11"
1026
+ },
1027
+ "papermill": {
1028
+ "default_parameters": {},
1029
+ "duration": 3943.115939,
1030
+ "end_time": "2025-08-21T23:09:35.210970",
1031
+ "environment_variables": {},
1032
+ "exception": null,
1033
+ "input_path": "fit_model.ipynb",
1034
+ "output_path": "ran/fit_model.ipynb",
1035
+ "parameters": {
1036
+ "input_file": "xgb_rnd3_next.csv",
1037
+ "output_file": "third_model_results.csv"
1038
+ },
1039
+ "start_time": "2025-08-21T22:03:52.095031",
1040
+ "version": "2.6.0"
1041
+ }
1042
+ },
1043
+ "nbformat": 4,
1044
+ "nbformat_minor": 5
1045
+ }
models/xgboost_third_model.json ADDED
The diff for this file is too large to render. See raw diff
 
models/xgboost_third_model_not_2025.json ADDED
The diff for this file is too large to render. See raw diff