mboukabous commited on
Commit
7c045bd
·
1 Parent(s): 6329c3b

Add application file

Browse files
README.md CHANGED
@@ -1,13 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Train Classificator
3
- emoji: 👀
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.9.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # AI-Algorithms-Made-Easy
2
+
3
+ **Under Development**
4
+
5
+ ![Under Development](under_development.png?raw=true "Under Development")
6
+
7
+ Welcome to **AI-Algorithms-Made-Easy**! This project is a comprehensive collection of artificial intelligence algorithms implemented from scratch using **PyTorch**. Our goal is to demystify AI by providing clear, easy-to-understand code and detailed explanations for each algorithm.
8
+
9
+ Whether you're a beginner in machine learning or an experienced practitioner, this project offers resources to enhance your understanding and skills in AI.
10
+
11
+ ---
12
+
13
+ ## Project Description
14
+
15
+ **AI-Algorithms-Made-Easy** aims to make AI accessible to everyone by:
16
+
17
+ - **Intuitive Implementations**: Breaking down complex algorithms into understandable components with step-by-step code.
18
+ - **Educational Notebooks**: Providing Jupyter notebooks that combine theory with practical examples.
19
+ - **Interactive Demos**: Offering user-friendly interfaces built with **Gradio** to experiment with algorithms in real-time.
20
+ - **Comprehensive Documentation**: Supplying in-depth guides and resources to support your AI learning journey.
21
+
22
+ Our mission is to simplify the learning process and provide hands-on tools to explore and understand AI concepts effectively.
23
+
24
  ---
25
+
26
+ ## Table of Contents
27
+
28
+ - [Algorithms Implemented](#algorithms-implemented)
29
+ - [Project Structure](#project-structure)
30
+ - [Installation](#installation)
31
+ - [Usage](#usage)
32
+ - [Contributing](#contributing)
33
+ - [License](#license)
34
+ - [Contact](#contact)
35
+
36
+ ---
37
+
38
+ ## Algorithms Implemented
39
+
40
+ *This project is currently under development. Stay tuned for updates!*
41
+
42
+ ### Supervised Learning (Scikit-Learn)
43
+ #### Regression ([Documentation](docs/Regression_Documentation.md), [Interface](https://huggingface.co/spaces/mboukabous/train_regression), [Notebook](notebooks/Train_Supervised_Regression_Models.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mboukabous/AI-Algorithms-Made-Easy/blob/main/notebooks/Train_Supervised_Regression_Models.ipynb))
44
+ - [Linear Regression](models/supervised/regression/linear_regression.py)
45
+ - [Ridge Regression](models/supervised/regression/ridge_regression.py)
46
+ - [Lasso Regression](models/supervised/regression/lasso_regression.py)
47
+ - [ElasticNet Regression](models/supervised/regression/elasticnet_regression.py)
48
+ - [Decision Tree](models/supervised/regression/decision_tree_regressor.py)
49
+ - [Random Forest (Bagging)](models/supervised/regression/random_forest_regressor.py)
50
+ - [Gradient Boosting (Boosting)](models/supervised/regression/gradient_boosting_regressor.py)
51
+ - [AdaBoost (Boosting)](models/supervised/regression/adaboost_regressor.py)
52
+ - [XGBoost (Boosting)](models/supervised/regression/xgboost_regressor.py)
53
+ - [LightGBM](models/supervised/regression/lightgbm_regressor.py)
54
+ - [CatBoost](models/supervised/regression/catboost_regressor.py)
55
+ - [Support Vector Regressor (SVR)](models/supervised/regression/support_vector_regressor.py)
56
+ - [K-Nearest Neighbors (KNN) Regressor](models/supervised/regression/knn_regressor.py)
57
+ - [Extra Trees Regressor](models/supervised/regression/extra_trees_regressor.py)
58
+ - [Multilayer Perceptron (MLP) Regressor](models/supervised/regression/mlp_regressor.py)
59
+
60
+ #### Classification ([Documentation](docs/Classification_Documentation.md))
61
+ - [Logistic Regression](models/supervised/classification/logistic_regression.py)
62
+ - [Decision Tree Classifier](models/supervised/classification/decision_tree_classifier.py)
63
+ - [Random Forest Classifier (Bagging)](models/supervised/classification/random_forest_classifier.py)
64
+ - [Extra Trees Classifier](models/supervised/classification/extra_trees_classifier.py)
65
+ - [Gradient Boosting Classifier (Boosting)](models/supervised/classification/gradient_boosting_classifier.py)
66
+ - [AdaBoost Classifier (Boosting)](models/supervised/classification/adaboost_classifier.py)
67
+ - [XGBoost Classifier (Boosting)](models/supervised/classification/xgboost_classifier.py)
68
+ - [LightGBM Classifier (Boosting)](models/supervised/classification/lightgbm_classifier.py)
69
+ - [CatBoost Classifier (Boosting)](models/supervised/classification/catboost_classifier.py)
70
+ - [Support Vector Classifier (SVC)](models/supervised/classification/svc.py)
71
+ - [K-Nearest Neighbors (KNN) Classifier](models/supervised/classification/knn_classifier.py)
72
+ - [Multilayer Perceptron (MLP) Classifier](models/supervised/classification/mlp_classifier.py)
73
+ - [GaussianNB (Naive Bayes Classifier)](models/supervised/classification/gaussian_nb.py)
74
+ - [Linear Discriminant Analysis (LDA)](models/supervised/classification/linear_discriminant_analysis.py)
75
+ - [Quadratic Discriminant Analysis (QDA)](models/supervised/classification/quadratic_discriminant_analysis.py)
76
+
77
+ ### Unsupervised Learning
78
+
79
+ - K-Means Clustering
80
+ - Principal Component Analysis (PCA)
81
+ - Hierarchical Clustering
82
+ - Autoencoders
83
+ - Isolation Forest
84
+ - Gaussian Mixture Models
85
+
86
+ ### Deep Learning (DL)
87
+
88
+ - Convolutional Neural Networks (CNN)
89
+ - Recurrent Neural Networks (RNN)
90
+ - Long Short-Term Memory Networks (LSTM)
91
+ - Gated Recurrent Unit (GRU)
92
+ - Generative Adversarial Networks (GAN)
93
+ - Transformers
94
+ - Attention Mechanisms
95
+
96
+ ### Computer Vision
97
+
98
+ - Image Classification/Transfer learning (TL)
99
+ - Object Detection
100
+ - Semantic Segmentation
101
+ - Style Transfer
102
+ - Image Captioning
103
+ - Generative Models
104
+
105
+ ### Natural Language Processing (NLP)
106
+
107
+ - Sentiment Analysis (SA)
108
+ - Machine Translation
109
+ - Named Entity Recognition (NER)
110
+ - Text Classification
111
+ - Text Summarization
112
+ - Question Answering
113
+ - Language Modeling
114
+ - Transformer Models
115
+
116
+ ### Time Series Analysis
117
+
118
+ - Time Series Forecasting with RNNs
119
+ - Temporal Convolutional Networks (TCNs)
120
+ - Transformers for Time Series
121
+
122
+ ### Reinforcement Learning
123
+
124
+ - Q-Learning
125
+ - Deep Q-Networks (DQN)
126
+ - Policy Gradients
127
+ - Actor-Critic Methods
128
+ - Proximal Policy Optimization
129
+
130
+ ### and more ...
131
+
132
+ ---
133
+
134
+ ## Project Structure
135
+
136
+ - **models/**: Contains all the AI algorithm implementations, organized by category.
137
+ - **data/**: Includes datasets and data preprocessing utilities.
138
+ - **utils/**: Utility scripts and helper functions.
139
+ - **scripts/**: Executable scripts for training, testing, and other tasks.
140
+ - **interfaces/**: Interactive applications using Gradio and web interfaces.
141
+ - **notebooks/**: Jupyter notebooks for tutorials and demonstrations.
142
+ - **deploy/**: Scripts and instructions for deploying models.
143
+ - **website/**: Files related to the project website.
144
+ - **docs/**: Project documentation.
145
+ - **examples/**: Example scripts demonstrating how to use the models.
146
+
147
+ ---
148
+
149
+ ## Installation
150
+
151
+ *Installation instructions will be provided once the initial release is available.*
152
+
153
+ ---
154
+
155
+ ## Usage
156
+
157
+ *Usage examples and tutorials will be added as the project develops.*
158
+
159
+ ---
160
+
161
+ ## Contributing
162
+
163
+ We welcome contributions from the community! To contribute:
164
+
165
+ 1. **Fork the repository** on GitHub.
166
+ 2. **Clone your fork** to your local machine.
167
+ 3. **Create a new branch** for your feature or bug fix.
168
+ 4. **Make your changes** and commit them with descriptive messages.
169
+ 5. **Push your changes** to your forked repository.
170
+ 6. **Open a pull request** to the main repository.
171
+
172
+ Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.
173
+
174
+ ---
175
+
176
+ ## License
177
+
178
+ This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
179
+
180
+ ---
181
+
182
+ ## Contact
183
+
184
+ For questions, suggestions, or feedback:
185
+
186
+ - **GitHub Issues**: Please open an issue on the [GitHub repository](https://github.com/mboukabous/AI-Algorithms-Made-Easy/issues).
187
+ - **Email**: You can reach us at [m.boukabous95@gmail.com](mailto:m.boukabous95@gmail.com).
188
+
189
+ ---
190
+
191
+ *Thank you for your interest in **AI-Algorithms-Made-Easy**! We are excited to build this resource and appreciate your support and contributions.*
192
+
193
+ ---
194
+
195
+ ## Acknowledgments
196
+
197
+ - **PyTorch**: For providing an excellent deep learning framework.
198
+ - **Gradio**: For simplifying the creation of interactive demos.
199
+ - **OpenAI's ChatGPT**: For assistance in planning and drafting project materials.
200
+
201
+ ---
202
+
203
+ ## Stay Updated
204
+
205
+ - **Watch** this repository for updates.
206
+ - **Star** the project if you find it helpful.
207
+ - **Share** with others who might be interested in learning AI algorithms.
208
+
209
  ---
210
 
211
+ *Let's make AI accessible and easy to learn for everyone!*
app.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Gradio Interface for Training Classification Models
4
+
5
+ This script provides a Gradio-based user interface to train classification models using various datasets
6
+ and algorithms. It allows users to select models, preprocess data, specify hyperparameters, and visualize
7
+ results through an intuitive web interface.
8
+
9
+ Features:
10
+ - **Model Selection**: Choose from classification algorithms in `models/supervised/classification`.
11
+ - **Dataset Input Options**:
12
+ - Upload a local CSV file.
13
+ - Specify a path to a dataset.
14
+ - Download datasets from Kaggle by uploading `kaggle.json` and specifying a competition name.
15
+ - **Hyperparameter Customization**: Modify parameters such as test size, random state, CV folds, and scoring metric.
16
+ - **Visualizations**: If enabled, generate classification metrics charts and confusion matrices after training.
17
+ - **Interactive Training**: Outputs training metrics, best hyperparameters, and paths to saved models.
18
+
19
+ Usage:
20
+ - Place this script in `interfaces/gradio/`.
21
+ - Ensure proper project structure and availability of `train_classification_model.py` and classification model modules.
22
+ - Run the script. A Gradio interface will launch for interactive model training.
23
+
24
+ Requirements:
25
+ - Python 3.7 or higher
26
+ - Required Python libraries as specified in `requirements.txt`
27
+ - Properly structured project with `train_classification_model.py` and classification modules.
28
+ """
29
+
30
+ import gradio as gr
31
+ import pandas as pd
32
+ import os
33
+ import subprocess
34
+ import sys
35
+ import glob
36
+ import re
37
+
38
+ # Add the project root directory to the Python path
39
+ current_dir = os.path.dirname(os.path.abspath(__file__))
40
+ project_root = os.path.abspath(os.path.join(current_dir, '../../'))
41
+ sys.path.append(project_root)
42
+
43
+ def get_classification_model_modules():
44
+ # Get the list of available classification model modules
45
+ models_dir = os.path.join(project_root, 'models', 'supervised', 'classification')
46
+ model_files = glob.glob(os.path.join(models_dir, '*.py'))
47
+
48
+ print(f"Looking for model files in: {models_dir}")
49
+ print(f"Found model files: {model_files}")
50
+
51
+ models = [os.path.splitext(os.path.basename(f))[0] for f in model_files if not f.endswith('__init__.py')]
52
+ model_modules = [f"{model}" for model in models]
53
+ return model_modules
54
+
55
+ def download_kaggle_data(json_path, competition_name):
56
+ # Import the get_kaggle_data function
57
+ from data.datasets.kaggle_data import get_kaggle_data
58
+ data_path = get_kaggle_data(json_path=json_path, data_name=competition_name, is_competition=True)
59
+ return data_path
60
+
61
+ def train_model(model_module, data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name,
62
+ target_variable, drop_columns, test_size, random_state, cv_folds,
63
+ scoring_metric, model_save_path, results_save_path, visualize):
64
+
65
+ # Determine data_path
66
+ if data_option == 'Upload Data File':
67
+ if data_file is None:
68
+ return "Please upload a data file.", None
69
+ data_path = data_file # data_file is the path to the uploaded file
70
+ elif data_option == 'Provide Data Path':
71
+ if not os.path.exists(data_path):
72
+ return "Provided data path does not exist.", None
73
+ elif data_option == 'Download from Kaggle':
74
+ if kaggle_json_file is None:
75
+ return "Please upload your kaggle.json file.", None
76
+ else:
77
+ # Save the kaggle.json file to ~/.kaggle/kaggle.json
78
+ import shutil
79
+ kaggle_config_dir = os.path.expanduser('~/.kaggle')
80
+ os.makedirs(kaggle_config_dir, exist_ok=True)
81
+ kaggle_json_path = os.path.join(kaggle_config_dir, 'kaggle.json')
82
+ shutil.copy(kaggle_json_file.name, kaggle_json_path)
83
+ os.chmod(kaggle_json_path, 0o600)
84
+ data_dir = download_kaggle_data(json_path=kaggle_json_path, competition_name=competition_name)
85
+ if data_dir is None:
86
+ return "Failed to download data from Kaggle.", None
87
+ # Use the specified data_name_kaggle
88
+ data_path = os.path.join(data_dir, data_name_kaggle)
89
+ if not os.path.exists(data_path):
90
+ return f"{data_name_kaggle} not found in the downloaded Kaggle data.", None
91
+ else:
92
+ return "Invalid data option selected.", None
93
+
94
+ # Prepare command-line arguments for train_classification_model.py
95
+ cmd = [sys.executable, os.path.join(project_root, 'scripts', 'train_classification_model.py')]
96
+ cmd.extend(['--model_module', model_module])
97
+ cmd.extend(['--data_path', data_path])
98
+ cmd.extend(['--target_variable', target_variable])
99
+
100
+ if drop_columns:
101
+ cmd.extend(['--drop_columns', ','.join(drop_columns)])
102
+ if test_size != 0.2:
103
+ cmd.extend(['--test_size', str(test_size)])
104
+ if random_state != 42:
105
+ cmd.extend(['--random_state', str(int(random_state))])
106
+ if cv_folds != 5:
107
+ cmd.extend(['--cv_folds', str(int(cv_folds))])
108
+ if scoring_metric:
109
+ cmd.extend(['--scoring_metric', scoring_metric])
110
+ if model_save_path:
111
+ cmd.extend(['--model_path', model_save_path])
112
+ if results_save_path:
113
+ cmd.extend(['--results_path', results_save_path])
114
+ if visualize:
115
+ cmd.append('--visualize')
116
+
117
+ print(f"Executing command: {' '.join(cmd)}")
118
+
119
+ try:
120
+ result = subprocess.run(cmd, capture_output=True, text=True)
121
+ output = result.stdout
122
+ errors = result.stderr
123
+ if result.returncode != 0:
124
+ return f"Error during training:\n{errors}", None
125
+ else:
126
+ # Clean up output
127
+ output = re.sub(r"Figure\(\d+x\d+\)", "", output).strip()
128
+
129
+ # Attempt to find confusion_matrix.png if visualize is True
130
+ plot_image_path = None
131
+ if results_save_path:
132
+ # Showing the confusion matrix
133
+ plot_image_path = os.path.join(results_save_path, 'confusion_matrix.png')
134
+ else:
135
+ # Default path if results_save_path is not provided
136
+ plot_image_path = output.split('Confusion matrix saved to ')[1].strip()
137
+ return f"Training completed successfully.\n\n{output}", plot_image_path
138
+ except Exception as e:
139
+ return f"An error occurred:\n{str(e)}", None
140
+
141
+ def get_columns_from_data(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name):
142
+ # Determine data_path
143
+ if data_option == 'Upload Data File':
144
+ if data_file is None:
145
+ return []
146
+ data_path = data_file
147
+ elif data_option == 'Provide Data Path':
148
+ if not os.path.exists(data_path):
149
+ return []
150
+ elif data_option == 'Download from Kaggle':
151
+ if kaggle_json_file is None:
152
+ return []
153
+ else:
154
+ import shutil
155
+ kaggle_config_dir = os.path.expanduser('~/.kaggle')
156
+ os.makedirs(kaggle_config_dir, exist_ok=True)
157
+ kaggle_json_path = os.path.join(kaggle_config_dir, 'kaggle.json')
158
+ shutil.copy(kaggle_json_file.name, kaggle_json_path)
159
+ os.chmod(kaggle_json_path, 0o600)
160
+ data_dir = download_kaggle_data(json_path=kaggle_json_path, competition_name=competition_name)
161
+ if data_dir is None:
162
+ return []
163
+ data_path = os.path.join(data_dir, data_name_kaggle)
164
+ if not os.path.exists(data_path):
165
+ return []
166
+ else:
167
+ return []
168
+
169
+ try:
170
+ data = pd.read_csv(data_path)
171
+ columns = data.columns.tolist()
172
+ return columns
173
+ except Exception as e:
174
+ print(f"Error reading data file: {e}")
175
+ return []
176
+
177
+ def update_columns(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name):
178
+ columns = get_columns_from_data(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name)
179
+ if not columns:
180
+ return gr.update(choices=[]), gr.update(choices=[])
181
+ else:
182
+ return gr.update(choices=columns), gr.update(choices=columns)
183
+
184
+ model_modules = get_classification_model_modules()
185
+
186
+ if not model_modules:
187
+ print("No classification model modules found. Check 'models/supervised/classification' directory.")
188
+
189
+ with gr.Blocks() as demo:
190
+ gr.Markdown("# Train a Classification Model")
191
+
192
+ with gr.Row():
193
+ model_module_input = gr.Dropdown(choices=model_modules, label="Select Classification Model Module")
194
+ scoring_metric_input = gr.Textbox(value='accuracy', label="Scoring Metric (e.g., accuracy, f1, roc_auc)")
195
+
196
+ with gr.Row():
197
+ test_size_input = gr.Slider(minimum=0.1, maximum=0.5, step=0.05, value=0.2, label="Test Size")
198
+ random_state_input = gr.Number(value=42, label="Random State")
199
+ cv_folds_input = gr.Number(value=5, label="CV Folds", precision=0)
200
+
201
+ visualize_input = gr.Checkbox(label="Generate Visualizations (metrics & confusion matrix)", value=True)
202
+
203
+ with gr.Row():
204
+ model_save_path_input = gr.Textbox(value='', label="Model Save Path (optional)")
205
+ results_save_path_input = gr.Textbox(value='', label="Results Save Path (optional)")
206
+
207
+ with gr.Tab("Data Input"):
208
+ data_option_input = gr.Radio(choices=['Upload Data File', 'Provide Data Path', 'Download from Kaggle'], label="Data Input Option", value='Upload Data File')
209
+
210
+ upload_data_col = gr.Column(visible=True)
211
+ with upload_data_col:
212
+ data_file_input = gr.File(label="Upload CSV Data File", type="filepath")
213
+
214
+ data_path_col = gr.Column(visible=False)
215
+ with data_path_col:
216
+ data_path_input = gr.Textbox(value='', label="Data File Path")
217
+
218
+ kaggle_data_col = gr.Column(visible=False)
219
+ with kaggle_data_col:
220
+ kaggle_json_file_input = gr.File(label="Upload kaggle.json File", type="filepath")
221
+ competition_name_input = gr.Textbox(value='', label="Kaggle Competition Name")
222
+ data_name_kaggle_input = gr.Textbox(value='train.csv', label="Data File Name (in Kaggle dataset)")
223
+
224
+ def toggle_data_input(option):
225
+ if option == 'Upload Data File':
226
+ return gr.update(visible=True), gr.update(visible=False), gr.update(visible=False)
227
+ elif option == 'Provide Data Path':
228
+ return gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
229
+ elif option == 'Download from Kaggle':
230
+ return gr.update(visible=False), gr.update(visible=False), gr.update(visible=True)
231
+
232
+ data_option_input.change(
233
+ fn=toggle_data_input,
234
+ inputs=[data_option_input],
235
+ outputs=[upload_data_col, data_path_col, kaggle_data_col]
236
+ )
237
+
238
+ update_cols_btn = gr.Button("Update Columns")
239
+
240
+ target_variable_input = gr.Dropdown(choices=[], label="Select Target Variable")
241
+ drop_columns_input = gr.CheckboxGroup(choices=[], label="Columns to Drop")
242
+
243
+ update_cols_btn.click(
244
+ fn=update_columns,
245
+ inputs=[data_option_input, data_file_input, data_path_input, data_name_kaggle_input, kaggle_json_file_input, competition_name_input],
246
+ outputs=[target_variable_input, drop_columns_input]
247
+ )
248
+
249
+ train_btn = gr.Button("Train Model")
250
+ output_display = gr.Textbox(label="Output")
251
+ image_display = gr.Image(label="Visualization", visible=True)
252
+
253
+ def run_training(*args):
254
+ output_text, plot_image_path = train_model(*args)
255
+ if plot_image_path and os.path.exists(plot_image_path):
256
+ return output_text, plot_image_path
257
+ else:
258
+ return output_text, None
259
+
260
+ train_btn.click(
261
+ fn=run_training,
262
+ inputs=[
263
+ model_module_input, data_option_input, data_file_input, data_path_input,
264
+ data_name_kaggle_input, kaggle_json_file_input, competition_name_input,
265
+ target_variable_input, drop_columns_input, test_size_input, random_state_input, cv_folds_input,
266
+ scoring_metric_input, model_save_path_input, results_save_path_input, visualize_input
267
+ ],
268
+ outputs=[output_display, image_display]
269
+ )
270
+
271
+ if __name__ == "__main__":
272
+ demo.launch()
data/README.md ADDED
@@ -0,0 +1 @@
 
 
1
+ # data
data/datasets/README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Datasets Utilities
2
+
3
+ This folder contains utility scripts for handling datasets, including downloading data from Kaggle.
4
+
5
+ ## 📄 Scripts
6
+
7
+ ### `kaggle_data.py`
8
+
9
+ - **Description**: A Python script to download Kaggle datasets or competition data seamlessly, supporting Google Colab, local Linux/Mac, and Windows environments.
10
+ - **Path**: [`data/datasets/kaggle_data.py`](kaggle_data.py)
11
+ - **Key Function**: `get_kaggle_data(json_path, data_name, is_competition=False, output_dir='data/raw')`
12
+ - **Example**:
13
+
14
+ ```python
15
+ from kaggle_data import get_kaggle_data
16
+
17
+ # Download a standard Kaggle dataset
18
+ dataset_path = get_kaggle_data("kaggle.json", "paultimothymooney/chest-xray-pneumonia")
19
+
20
+ # Download competition data
21
+ competition_path = get_kaggle_data("kaggle.json", "house-prices-advanced-regression-techniques", is_competition=True)
data/datasets/kaggle_data.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This module provides a utility function to download Kaggle datasets or competition data.
3
+
4
+ The function automatically detects whether it is running in a Google Colab environment, a local Linux/Mac environment, or a Windows environment, and sets up the Kaggle API accordingly.
5
+
6
+ Requirements:
7
+ - Kaggle API installed (`pip install kaggle`)
8
+ - Kaggle API key (`kaggle.json`) with appropriate permissions.
9
+
10
+ Environment Detection:
11
+ - Google Colab: Uses `/root/.config/kaggle/kaggle.json`.
12
+ - Local Linux/Mac: Uses `~/.kaggle/kaggle.json`.
13
+ - Windows: Uses `C:\\Users\\<Username>\\.kaggle\\kaggle.json`.
14
+
15
+ Functions:
16
+ get_kaggle_data(json_path: str, data_name: str, is_competition: bool = False, output_dir: str = "data/raw") -> str
17
+ """
18
+
19
+ import os
20
+ import zipfile
21
+ import sys
22
+ import shutil
23
+ import platform
24
+
25
+ def get_kaggle_data(json_path: str, data_name: str, is_competition: bool = False, output_dir: str = "data/raw") -> str:
26
+ """
27
+ Downloads a Kaggle dataset or competition data using the Kaggle API in Google Colab, local Linux/Mac, or Windows environment.
28
+
29
+ Parameters:
30
+ json_path (str): Path to your 'kaggle.json' file.
31
+ data_name (str): Kaggle dataset or competition name (e.g., 'paultimothymooney/chest-xray-pneumonia' or 'house-prices-advanced-regression-techniques').
32
+ is_competition (bool): Set to True if downloading competition data. Default is False (for datasets).
33
+ output_dir (str): Directory to save and extract the data. Default is 'data'.
34
+
35
+ Returns:
36
+ str: Path to the extracted dataset folder.
37
+
38
+ Raises:
39
+ OSError: If 'kaggle.json' is not found or cannot be copied.
40
+ Exception: If there is an error during download or extraction.
41
+
42
+ Example of Usage:
43
+ # For downloading a standard dataset
44
+ dataset_path = get_kaggle_data("kaggle.json", "paultimothymooney/chest-xray-pneumonia")
45
+ print(f"Dataset is available at: {dataset_path}")
46
+
47
+ # For downloading competition data
48
+ competition_path = get_kaggle_data("kaggle.json", "house-prices-advanced-regression-techniques", is_competition=True)
49
+ print(f"Competition data is available at: {competition_path}")
50
+ """
51
+ # Detect environment (Colab, local Linux/Mac, or Windows)
52
+ is_colab = "google.colab" in sys.modules
53
+ is_windows = platform.system() == "Windows"
54
+
55
+ # Step 1: Setup Kaggle API credentials
56
+ try:
57
+ if is_colab:
58
+ config_dir = "/root/.config/kaggle"
59
+ os.makedirs(config_dir, exist_ok=True)
60
+ print("Setting up Kaggle API credentials for Colab environment.")
61
+ shutil.copy(json_path, os.path.join(config_dir, "kaggle.json"))
62
+ os.chmod(os.path.join(config_dir, "kaggle.json"), 0o600)
63
+ else:
64
+ # For both local Linux/Mac and Windows, use the home directory
65
+ config_dir = os.path.join(os.path.expanduser("~"), ".kaggle")
66
+ os.makedirs(config_dir, exist_ok=True)
67
+ print("Setting up Kaggle API credentials for local environment.")
68
+ kaggle_json_dest = os.path.join(config_dir, "kaggle.json")
69
+ if not os.path.exists(kaggle_json_dest):
70
+ shutil.copy(json_path, kaggle_json_dest)
71
+ if not is_windows:
72
+ os.chmod(kaggle_json_dest, 0o600)
73
+ except Exception as e:
74
+ raise OSError(f"Could not set up Kaggle API credentials: {e}")
75
+
76
+ # Step 2: Create output directory
77
+ dataset_dir = os.path.join(output_dir, data_name.split('/')[-1])
78
+ os.makedirs(dataset_dir, exist_ok=True)
79
+ original_dir = os.getcwd()
80
+ os.chdir(dataset_dir)
81
+
82
+ # Step 3: Download the dataset or competition data
83
+ try:
84
+ if is_competition:
85
+ print(f"Downloading competition data: {data_name}")
86
+ cmd = f"kaggle competitions download -c {data_name}"
87
+ else:
88
+ print(f"Downloading dataset: {data_name}")
89
+ cmd = f"kaggle datasets download -d {data_name}"
90
+ os.system(cmd)
91
+ except Exception as e:
92
+ print(f"Error during download: {e}")
93
+ os.chdir(original_dir)
94
+ return None
95
+
96
+ # Step 4: Unzip all downloaded files
97
+ zip_files = [f for f in os.listdir() if f.endswith(".zip")]
98
+ if not zip_files:
99
+ print("No zip files found. Please check the dataset or competition name.")
100
+ os.chdir(original_dir)
101
+ return None
102
+
103
+ for zip_file in zip_files:
104
+ try:
105
+ with zipfile.ZipFile(zip_file, "r") as zip_ref:
106
+ zip_ref.extractall()
107
+ print(f"Extracted: {zip_file}")
108
+ os.remove(zip_file)
109
+ except Exception as e:
110
+ print(f"Error extracting {zip_file}: {e}")
111
+
112
+ # Step 5: Navigate back to the original directory
113
+ os.chdir(original_dir)
114
+
115
+ return dataset_dir
data/raw/README.md ADDED
@@ -0,0 +1 @@
 
 
1
+ # raw
models/README.md ADDED
@@ -0,0 +1 @@
 
 
1
+ # models
models/supervised/classification/README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Classification Models
2
+
3
+ This directory contains Python scripts that define various classification models and their associated hyperparameter grids. Each model file sets up a scikit-learn-compatible estimator and defines a parameter grid for use with the `train_classification_model.py` script.
4
+
5
+ These model definition files:
6
+ - Specify an estimator (e.g., `LogisticRegression()`, `RandomForestClassifier()`, `XGBClassifier()`).
7
+ - Define a `param_grid` dict for hyperparameter tuning using `GridSearchCV`.
8
+ - Optionally provide a `default_scoring` metric (e.g., `accuracy`).
9
+ - Work for both binary and multi-class classification tasks.
10
+ - Are intended to be flexible and modular, allowing easy swapping of models without changing other parts of the code.
11
+
12
+ **Note:** Preprocessing steps, hyperparameter tuning logic, and label encoding for categorical targets are handled externally by the scripts and utilities.
13
+
14
+ ## Available Classification Models
15
+
16
+ - [Logistic Regression](logistic_regression.py)
17
+ - [Decision Tree Classifier](decision_tree_classifier.py)
18
+ - [Random Forest Classifier (Bagging)](random_forest_classifier.py)
19
+ - [Extra Trees Classifier](extra_trees_classifier.py)
20
+ - [Gradient Boosting Classifier (Boosting)](gradient_boosting_classifier.py)
21
+ - [AdaBoost Classifier (Boosting)](adaboost_classifier.py)
22
+ - [XGBoost Classifier (Boosting)](xgboost_classifier.py)
23
+ - [LightGBM Classifier (Boosting)](lightgbm_classifier.py)
24
+ - [CatBoost Classifier (Boosting)](catboost_classifier.py)
25
+ - [Support Vector Classifier (SVC)](svc.py)
26
+ - [K-Nearest Neighbors (KNN) Classifier](knn_classifier.py)
27
+ - [Multilayer Perceptron (MLP) Classifier](mlp_classifier.py)
28
+ - [GaussianNB (Naive Bayes Classifier)](gaussian_nb.py)
29
+ - [Linear Discriminant Analysis (LDA)](linear_discriminant_analysis.py)
30
+ - [Quadratic Discriminant Analysis (QDA)](quadratic_discriminant_analysis.py)
31
+
32
+ To train any of these models, specify the `--model_module` argument with the appropriate model name (e.g., `logistic_regression`) when running `train_classification_model.py`.
models/supervised/classification/adaboost_classifier.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ AdaBoost Classifier setup.
4
+
5
+ Features:
6
+ - Uses `AdaBoostClassifier` wrapping a weak learner (by default DecisionTreeClassifier).
7
+ - Suitable for binary and multi-class tasks (OvR approach).
8
+ - Default scoring: 'accuracy'.
9
+ """
10
+
11
+ from sklearn.ensemble import AdaBoostClassifier
12
+
13
+ estimator = AdaBoostClassifier(random_state=42)
14
+
15
+ param_grid = {
16
+ 'model__n_estimators': [100],
17
+ 'model__learning_rate': [0.5, 1.0],
18
+ 'model__algorithm': ['SAMME'],
19
+ # Preprocessing params
20
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
21
+ #'preprocessor__num__scaler__with_mean': [True,False],
22
+ #'preprocessor__num__scaler__with_std': [True,False],
23
+ }
24
+
25
+ default_scoring = 'accuracy'
models/supervised/classification/catboost_classifier.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ CatBoost Classifier setup.
4
+
5
+ Features:
6
+ - Uses `CatBoostClassifier`.
7
+ - Handles categorical features natively but we still rely on pipeline encoding.
8
+ - Good for both binary and multi-class.
9
+ - Default scoring: 'accuracy'.
10
+
11
+ Requires `catboost` installed.
12
+ """
13
+
14
+ from catboost import CatBoostClassifier
15
+
16
+ estimator = CatBoostClassifier(verbose=0, random_state=42)
17
+
18
+ param_grid = {
19
+ 'model__iterations': [100],
20
+ 'model__depth': [3, 5],
21
+ 'model__learning_rate': [0.01, 0.1],
22
+ # Preprocessing params
23
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
24
+ #'preprocessor__num__scaler__with_mean': [True,False],
25
+ #'preprocessor__num__scaler__with_std': [True,False],
26
+ }
27
+
28
+ default_scoring = 'accuracy'
models/supervised/classification/decision_tree_classifier.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ This module sets up a Decision Tree Classifier for hyperparameter tuning.
4
+
5
+ Features:
6
+ - Uses `DecisionTreeClassifier` from scikit-learn.
7
+ - Defines a parameter grid suitable for both binary and multi-class classification.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `criterion`, `max_depth`, `min_samples_split`, and `min_samples_leaf` are common parameters to tune.
12
+ - Ordinal encoding will be used for tree-based models if implemented, but the pipeline code decides that.
13
+
14
+ """
15
+
16
+ from sklearn.tree import DecisionTreeClassifier
17
+
18
+ estimator = DecisionTreeClassifier(random_state=42)
19
+
20
+ param_grid = {
21
+ 'model__criterion': ['gini', 'entropy'],
22
+ 'model__max_depth': [None, 5, 10],
23
+ 'model__min_samples_split': [2, 5],
24
+ 'model__min_samples_leaf': [1, 2],
25
+ # Preprocessing params
26
+ #'preprocessor__num__imputer__strategy': ['mean', 'median'],
27
+ #'preprocessor__num__scaler__with_mean': [True, False],
28
+ #'preprocessor__num__scaler__with_std': [True, False],
29
+ }
30
+
31
+ default_scoring = 'accuracy'
models/supervised/classification/extra_trees_classifier.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Extra Trees Classifier setup.
4
+
5
+ Features:
6
+ - Uses `ExtraTreesClassifier`.
7
+ - Similar to RandomForest but with more randomness in splits.
8
+ - Works well for both binary and multi-class.
9
+ """
10
+
11
+ from sklearn.ensemble import ExtraTreesClassifier
12
+
13
+ estimator = ExtraTreesClassifier(random_state=42)
14
+
15
+ param_grid = {
16
+ 'model__n_estimators': [100],
17
+ 'model__max_depth': [None, 10],
18
+ 'model__min_samples_split': [2, 5],
19
+ 'model__min_samples_leaf': [1],
20
+ # Preprocessing params
21
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
22
+ #'preprocessor__num__scaler__with_mean': [True,False],
23
+ #'preprocessor__num__scaler__with_std': [True,False],
24
+ }
25
+
26
+ default_scoring = 'accuracy'
models/supervised/classification/gaussian_nb.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Gaussian Naive Bayes Classifier setup.
4
+
5
+ Features:
6
+ - Uses `GaussianNB`.
7
+ - Suitable for binary and multi-class.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `var_smoothing` is often the only parameter to tune.
12
+ """
13
+
14
+ from sklearn.naive_bayes import GaussianNB
15
+
16
+ estimator = GaussianNB()
17
+
18
+ param_grid = {
19
+ 'model__var_smoothing': [1e-1, 1e-3, 1e-5, 1e-7, 1e-9],
20
+ # Preprocessing params
21
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
22
+ #'preprocessor__num__scaler__with_mean': [True,False],
23
+ #'preprocessor__num__scaler__with_std': [True,False],
24
+ }
25
+
26
+ default_scoring = 'accuracy'
models/supervised/classification/gradient_boosting_classifier.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Gradient Boosting Classifier setup.
4
+
5
+ Features:
6
+ - Uses `GradientBoostingClassifier`.
7
+ - Great for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+ """
10
+
11
+ from sklearn.ensemble import GradientBoostingClassifier
12
+
13
+ estimator = GradientBoostingClassifier(random_state=42)
14
+
15
+ param_grid = {
16
+ 'model__n_estimators': [100],
17
+ 'model__learning_rate': [0.01, 0.1],
18
+ 'model__max_depth': [3],
19
+ # Preprocessing params
20
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
21
+ #'preprocessor__num__scaler__with_mean': [True,False],
22
+ #'preprocessor__num__scaler__with_std': [True,False],
23
+ }
24
+
25
+ default_scoring = 'accuracy'
models/supervised/classification/knn_classifier.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ K-Nearest Neighbors Classifier setup.
4
+
5
+ Features:
6
+ - Uses `KNeighborsClassifier`.
7
+ - Works for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `n_neighbors`, `weights`, and `p` (Minkowski distance) are common parameters to tune.
12
+ """
13
+
14
+ from sklearn.neighbors import KNeighborsClassifier
15
+
16
+ estimator = KNeighborsClassifier()
17
+
18
+ param_grid = {
19
+ 'model__n_neighbors': [3, 5], # Reduced to two neighbor options
20
+ 'model__weights': ['uniform'], # Focused on one weighting strategy
21
+ 'model__p': [2], # Fixed to Euclidean distance
22
+ # Preprocessing params
23
+ #'preprocessor__num__imputer__strategy': ['mean'],
24
+ #'preprocessor__num__scaler__with_mean': [True],
25
+ #'preprocessor__num__scaler__with_std': [True],
26
+ }
27
+
28
+ default_scoring = 'accuracy'
models/supervised/classification/lightgbm_classifier.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ LightGBM Classifier setup.
4
+
5
+ Features:
6
+ - Uses `LGBMClassifier`.
7
+ - Fast and efficient for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Requires `lightgbm` installed.
11
+ """
12
+
13
+ from lightgbm import LGBMClassifier
14
+
15
+ estimator = LGBMClassifier(verbose=-1, random_state=42)
16
+
17
+ param_grid = {
18
+ 'model__n_estimators': [100],
19
+ 'model__num_leaves': [31, 63],
20
+ 'model__learning_rate': [0.01, 0.1],
21
+ # Preprocessing params
22
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
23
+ #'preprocessor__num__scaler__with_mean': [True,False],
24
+ #'preprocessor__num__scaler__with_std': [True,False],
25
+ }
26
+
27
+ default_scoring = 'accuracy'
models/supervised/classification/linear_discriminant_analysis.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Linear Discriminant Analysis (LDA) Classifier setup.
4
+
5
+ Features:
6
+ - Uses `LinearDiscriminantAnalysis`.
7
+ - Works for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `solver` can be tuned.
12
+ - Some solvers allow `shrinkage` parameter.
13
+ """
14
+
15
+ from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
16
+
17
+ estimator = LinearDiscriminantAnalysis()
18
+
19
+ param_grid = {
20
+ 'model__solver': ['svd', 'lsqr'],
21
+ # If solver='lsqr', can tune shrinkage parameter if needed
22
+ # Preprocessing params
23
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
24
+ #'preprocessor__num__scaler__with_mean': [True,False],
25
+ #'preprocessor__num__scaler__with_std': [True,False],
26
+ }
27
+
28
+ default_scoring = 'accuracy'
models/supervised/classification/logistic_regression.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ This module sets up a Logistic Regression classifier for hyperparameter tuning.
4
+
5
+ Features:
6
+ - Uses `LogisticRegression` from scikit-learn.
7
+ - Defines a hyperparameter grid for both preprocessing and model parameters.
8
+ - Suitable for binary and multi-class classification (LogisticRegression uses OvR/One-vs-Rest by default).
9
+ - Default scoring: 'accuracy', which works well for both binary and multi-class tasks.
10
+
11
+ Considerations:
12
+ - Adjusting `C` controls regularization strength.
13
+ - `penalty='l2'` is commonly used.
14
+ - One can add more solvers or penalties as needed.
15
+ """
16
+
17
+ from sklearn.linear_model import LogisticRegression
18
+
19
+ # Define the estimator
20
+ estimator = LogisticRegression()
21
+
22
+ # Define the hyperparameter grid
23
+ param_grid = {
24
+ # Model parameters
25
+ 'model__C': [0.01, 0.1, 1.0, 10.0], # Regularization strength
26
+ 'model__penalty': ['l2'], # Only L2 regularization supported in LogisticRegression(solver='lbfgs')
27
+ 'model__solver': ['lbfgs'], # Efficient solver for large datasets
28
+ 'model__max_iter': [1000] # Control convergence
29
+ # Preprocessing parameters for numerical features
30
+ #'preprocessor__num__imputer__strategy': ['mean', 'median'],
31
+ #'preprocessor__num__scaler__with_mean': [True, False],
32
+ #'preprocessor__num__scaler__with_std': [True, False],
33
+ }
34
+
35
+ # Optional: Default scoring metric for classification
36
+ default_scoring = 'accuracy'
models/supervised/classification/mlp_classifier.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ MLP Classifier setup.
4
+
5
+ Features:
6
+ - Uses `MLPClassifier`.
7
+ - Suitable for binary and multi-class classification.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `hidden_layer_sizes`, `alpha` (L2 regularization), and `learning_rate_init` are common parameters.
12
+ - Increase `max_iter` if convergence warnings appear.
13
+ """
14
+
15
+ from sklearn.neural_network import MLPClassifier
16
+
17
+ # Define the estimator
18
+ estimator = MLPClassifier(max_iter=200, random_state=42)
19
+
20
+ # Define the hyperparameter grid
21
+ param_grid = {
22
+ 'model__hidden_layer_sizes': [(50,)], # Reduced size of hidden layers for faster training
23
+ 'model__alpha': [0.001], # Retained commonly effective value
24
+ 'model__learning_rate_init': [0.001], # Focused on a single typical value for faster tuning
25
+ # Uncomment and customize preprocessing params if needed
26
+ #'preprocessor__num__imputer__strategy': ['mean'],
27
+ #'preprocessor__num__scaler__with_mean': [True],
28
+ #'preprocessor__num__scaler__with_std': [True],
29
+ }
30
+
31
+ default_scoring = 'accuracy'
models/supervised/classification/quadratic_discriminant_analysis.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Quadratic Discriminant Analysis (QDA) Classifier setup.
4
+
5
+ Features:
6
+ - Uses `QuadraticDiscriminantAnalysis`.
7
+ - Works for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `reg_param` can be tuned to control regularization.
12
+ """
13
+
14
+ from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
15
+
16
+ estimator = QuadraticDiscriminantAnalysis()
17
+
18
+ param_grid = {
19
+ 'model__reg_param': [0.0, 0.1, 0.5],
20
+ # Preprocessing params
21
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
22
+ #'preprocessor__num__scaler__with_mean': [True,False],
23
+ #'preprocessor__num__scaler__with_std': [True,False],
24
+ }
25
+
26
+ default_scoring = 'accuracy'
models/supervised/classification/random_forest_classifier.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Random Forest Classifier setup.
4
+
5
+ Features:
6
+ - Uses `RandomForestClassifier` from scikit-learn.
7
+ - Good general-purpose model for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+ """
10
+
11
+ from sklearn.ensemble import RandomForestClassifier
12
+
13
+ estimator = RandomForestClassifier(random_state=42)
14
+
15
+ param_grid = {
16
+ 'model__n_estimators': [100],
17
+ 'model__max_depth': [None, 10],
18
+ 'model__min_samples_split': [2, 5],
19
+ 'model__min_samples_leaf': [1],
20
+ # Preprocessing params
21
+ #'preprocessor__num__imputer__strategy': ['mean', 'median'],
22
+ #'preprocessor__num__scaler__with_mean': [True, False],
23
+ #'preprocessor__num__scaler__with_std': [True, False],
24
+ }
25
+
26
+ default_scoring = 'accuracy'
models/supervised/classification/svc.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ Support Vector Classifier setup.
4
+
5
+ Features:
6
+ - Uses `SVC` from scikit-learn.
7
+ - Handles binary classification naturally, and multi-class via OvR by default.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Considerations:
11
+ - `C` and `kernel` are key parameters.
12
+ - If `kernel='rbf'`, also tune `gamma`.
13
+ """
14
+
15
+ from sklearn.svm import SVC
16
+
17
+ estimator = SVC(random_state=42)
18
+
19
+ param_grid = {
20
+ 'model__C': [0.1, 1.0], # Reduced the range
21
+ 'model__kernel': ['linear'], # Focused on linear kernel
22
+ 'model__gamma': ['scale'], # Fixed the gamma to one option
23
+ # Preprocessing params
24
+ #'preprocessor__num__imputer__strategy': ['mean'],
25
+ #'preprocessor__num__scaler__with_mean': [True],
26
+ #'preprocessor__num__scaler__with_std': [True],
27
+ }
28
+
29
+ default_scoring = 'accuracy'
models/supervised/classification/xgboost_classifier.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ XGBoost Classifier setup.
4
+
5
+ Features:
6
+ - Uses `XGBClassifier` from xgboost library.
7
+ - Excellent performance for binary and multi-class tasks.
8
+ - Default scoring: 'accuracy'.
9
+
10
+ Note: Ensure `xgboost` is installed.
11
+ """
12
+
13
+ from xgboost import XGBClassifier
14
+
15
+ estimator = XGBClassifier(eval_metric='logloss', random_state=42)
16
+
17
+ param_grid = {
18
+ 'model__n_estimators': [100],
19
+ 'model__max_depth': [3, 5],
20
+ 'model__learning_rate': [0.01, 0.1],
21
+ # Preprocessing params
22
+ #'preprocessor__num__imputer__strategy': ['mean','median'],
23
+ #'preprocessor__num__scaler__with_mean': [True,False],
24
+ #'preprocessor__num__scaler__with_std': [True,False],
25
+ }
26
+
27
+ default_scoring = 'accuracy'
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas==2.2.2
2
+ numpy==1.26.4
3
+ matplotlib==3.8.0
4
+ seaborn==0.13.2
5
+ kaggle==1.6.17
6
+ scikit-learn==1.5.2
7
+ catboost==1.2.7
8
+ dask[dataframe]==2024.10.0
9
+ xgboost==2.1.2
10
+ lightgbm==4.5.0
11
+ joblib==1.4.2
12
+ gradio==5.7.1
scripts/README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Scripts
2
+
3
+ This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.
4
+
5
+ ## Contents
6
+
7
+ - [`train_regression_model.py`](#train_regression_modelpy)
8
+ - [`train_classification_model.py`](#train_classification_modelpy)
9
+
10
+ ### `train_regression_model.py`
11
+
12
+ A script for training supervised learning **regression** models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.
13
+
14
+ #### Features
15
+
16
+ - Supports various regression models defined in `models/supervised/regression`.
17
+ - Performs hyperparameter tuning using grid search cross-validation.
18
+ - Saves trained models and evaluation metrics.
19
+ - Generates visualizations if specified.
20
+
21
+ #### Usage
22
+
23
+ ```bash
24
+ python train_regression_model.py --model_module MODEL_MODULE \
25
+ --data_path DATA_PATH/DATA_NAME.csv \
26
+ --target_variable TARGET_VARIABLE [OPTIONS]
27
+
28
+ ```
29
+
30
+ - **Required Arguments:**
31
+ - `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
32
+ - `data_path`: Path to the dataset directory, including the data file name.
33
+ - `target_variable`: Name of the target variable.
34
+
35
+ - **Optional Arguments:**
36
+ - `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
37
+ - `random_state`: Random seed for reproducibility (default: `42`).
38
+ - `log_transform`: Apply log transformation to the target variable (regression only).
39
+ - `cv_folds`: Number of cross-validation folds (default: `5`).
40
+ - `scoring_metric`: Scoring metric for model evaluation.
41
+ - `model_path`: Path to save the trained model.
42
+ - `results_path`: Path to save results and metrics.
43
+ - `visualize`: Generate and save visualizations.
44
+ - `drop_columns`: Comma-separated column names to drop from the dataset.
45
+
46
+ #### Usage Example
47
+
48
+ ```bash
49
+ python train_regression_model.py --model_module linear_regression \
50
+ --data_path data/house_prices/train.csv \
51
+ --target_variable SalePrice --drop_columns Id \
52
+ --log_transform --visualize
53
+ ```
54
+
55
+ ---
56
+
57
+ ### `train_classification_model.py`
58
+
59
+ A script for training supervised learning **classification** models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.
60
+
61
+ #### Features
62
+
63
+ - Supports various classification models defined in `models/supervised/classification`.
64
+ - Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
65
+ - Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
66
+ - If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.
67
+
68
+ #### Usage
69
+
70
+ ```bash
71
+ python train_classification_model.py --model_module MODEL_MODULE \
72
+ --data_path DATA_PATH/DATA_NAME.csv \
73
+ --target_variable TARGET_VARIABLE [OPTIONS]
74
+
75
+ ```
76
+
77
+ - **Required Arguments:**
78
+ - `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
79
+ - `data_path`: Path to the dataset directory, including the data file name.
80
+ - `target_variable`: Name of the target variable (categorical).
81
+
82
+ - **Optional Arguments:**
83
+ - `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
84
+ - `random_state`: Random seed for reproducibility (default: `42`).
85
+ - `cv_folds`: Number of cross-validation folds (default: `5`).
86
+ - `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
87
+ - `model_path`: Path to save the trained model.
88
+ - `results_path`: Path to save results and metrics.
89
+ - `visualize`: Generate and save visualizations.
90
+ - `drop_columns`: Comma-separated column names to drop from the dataset.
91
+
92
+ #### Usage Example
93
+
94
+ ```bash
95
+ python train_classification_model.py --model_module logistic_regression \
96
+ --data_path data/adult_income/train.csv \
97
+ --target_variable income_bracket \
98
+ --scoring_metric accuracy --visualize
99
+ ```
scripts/train_classification_model.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ This script trains classification models using scikit-learn.
4
+ It handles data loading, preprocessing, hyperparameter tuning,
5
+ model evaluation with classification metrics, and saving of models,
6
+ metrics, and visualizations.
7
+
8
+ Usage:
9
+ python train_classification_model.py --model_module MODEL_MODULE --data_path DATA_PATH/DATA_NAME.csv
10
+ --target_variable TARGET_VARIABLE
11
+
12
+ Optional arguments:
13
+ --test_size TEST_SIZE
14
+ --random_state RANDOM_STATE
15
+ --cv_folds CV_FOLDS
16
+ --scoring_metric SCORING_METRIC
17
+ --model_path MODEL_PATH
18
+ --results_path RESULTS_PATH
19
+ --visualize
20
+ --drop_columns COLUMN_NAMES
21
+
22
+ Example:
23
+ python train_classification_model.py --model_module logistic_regression
24
+ --data_path data/adult_income/train.csv
25
+ --target_variable income_bracket --drop_columns Id
26
+ --scoring_metric accuracy --visualize
27
+ """
28
+
29
+ import os
30
+ import sys
31
+ import argparse
32
+ import importlib
33
+ import pandas as pd
34
+ import numpy as np
35
+ import matplotlib.pyplot as plt
36
+ import seaborn as sns
37
+ from sklearn.model_selection import train_test_split
38
+ from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
39
+ confusion_matrix, ConfusionMatrixDisplay)
40
+ import joblib
41
+ from timeit import default_timer as timer
42
+
43
+ def main(args):
44
+ # Change to the root directory of the project
45
+ project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
46
+ os.chdir(project_root)
47
+ sys.path.insert(0, project_root)
48
+
49
+ # Import the hyperparameter tuning and the model modules
50
+ from utils.supervised_hyperparameter_tuning import classification_hyperparameter_tuning
51
+ model_module_path = f"models.supervised.classification.{args.model_module}"
52
+ model_module = importlib.import_module(model_module_path)
53
+
54
+ # Get the model estimator, parameters grid, and scoring metric
55
+ estimator = model_module.estimator
56
+ param_grid = model_module.param_grid
57
+ scoring_metric = args.scoring_metric or getattr(model_module, 'default_scoring', 'accuracy')
58
+ model_name = estimator.__class__.__name__
59
+
60
+ # Set default paths if not provided
61
+ args.model_path = args.model_path or os.path.join('saved_models', model_name)
62
+ args.results_path = args.results_path or os.path.join('results', model_name)
63
+ os.makedirs(args.results_path, exist_ok=True)
64
+
65
+ # Load the dataset
66
+ df = pd.read_csv(os.path.join(args.data_path))
67
+
68
+ # Drop specified columns
69
+ if args.drop_columns:
70
+ columns_to_drop = args.drop_columns.split(',')
71
+ df = df.drop(columns=columns_to_drop)
72
+
73
+ # Define target variable and features
74
+ target_variable = args.target_variable
75
+ X = df.drop(columns=[target_variable])
76
+ y = df[target_variable]
77
+
78
+ # Ensure target variable is not numeric (or at least, is categorical)
79
+ # It's fine if it's numeric labels for classes, but typically classification is categorical.
80
+ # We'll just run as is and rely on the estimator to handle it.
81
+ # If needed, we can print a note:
82
+ if np.issubdtype(y.dtype, np.number) and len(np.unique(y)) > 20:
83
+ # Large number of unique values might indicate a regression-like problem
84
+ print(f"Warning: The target variable '{target_variable}' seems to have many unique numeric values. Ensure it's truly a classification problem.")
85
+
86
+ # Encode target variable if not numeric
87
+ if y.dtype == 'object' or not np.issubdtype(y.dtype, np.number):
88
+ from sklearn.preprocessing import LabelEncoder
89
+ le = LabelEncoder()
90
+ y = le.fit_transform(y)
91
+
92
+ # Save label encoder so that we can interpret predictions later
93
+ # Create model_path directory if not exists
94
+ os.makedirs(args.model_path, exist_ok=True)
95
+ joblib.dump(le, os.path.join(args.model_path, 'label_encoder.pkl'))
96
+ print("LabelEncoder applied to target variable. Classes:", le.classes_)
97
+
98
+ # Split the data
99
+ X_train, X_test, y_train, y_test = train_test_split(
100
+ X, y, test_size=args.test_size, random_state=args.random_state)
101
+
102
+ # Start the timer
103
+ start_time = timer()
104
+
105
+ # Perform hyperparameter tuning (classification)
106
+ best_model, best_params = classification_hyperparameter_tuning(
107
+ X_train, y_train, estimator, param_grid,
108
+ cv=args.cv_folds, scoring=scoring_metric)
109
+
110
+ # End the timer and calculate how long it took
111
+ end_time = timer()
112
+ train_time = end_time - start_time
113
+
114
+ # Evaluate the best model on the test set
115
+ y_pred = best_model.predict(X_test)
116
+
117
+ # Calculate classification metrics
118
+ accuracy = accuracy_score(y_test, y_pred)
119
+ precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
120
+ recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
121
+ f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
122
+
123
+ print(f"\n{model_name} Classification Metrics on Test Set:")
124
+ print(f"- Accuracy: {accuracy:.4f}")
125
+ print(f"- Precision: {precision:.4f}")
126
+ print(f"- Recall: {recall:.4f}")
127
+ print(f"- F1 Score: {f1:.4f}")
128
+ print(f"- Training Time: {train_time:.4f} seconds")
129
+
130
+ # Save the trained model
131
+ model_output_path = os.path.join(args.model_path, 'best_model.pkl')
132
+ os.makedirs(args.model_path, exist_ok=True)
133
+ joblib.dump(best_model, model_output_path)
134
+ print(f"Trained model saved to {model_output_path}")
135
+
136
+ # Save metrics to CSV
137
+ metrics = {
138
+ 'Accuracy': [accuracy],
139
+ 'Precision': [precision],
140
+ 'Recall': [recall],
141
+ 'F1 Score': [f1],
142
+ 'train_time': [train_time]
143
+ }
144
+ results_df = pd.DataFrame(metrics)
145
+ results_df.to_csv(os.path.join(args.results_path, 'metrics.csv'), index=False)
146
+ print(f"\nMetrics saved to {os.path.join(args.results_path, 'metrics.csv')}")
147
+
148
+ if args.visualize:
149
+ # Plot Classification Metrics
150
+ plt.figure(figsize=(8, 6))
151
+ metric_names = list(metrics.keys())
152
+ metric_values = [value[0] for value in metrics.values() if value[0] is not None and isinstance(value[0], (int,float))]
153
+ plt.bar(metric_names[:-1], metric_values[:-1], color='skyblue', alpha=0.8) # exclude train_time from plotting
154
+ plt.ylim(0, 1)
155
+ plt.xlabel('Metrics')
156
+ plt.ylabel('Scores')
157
+ plt.title('Classification Metrics')
158
+ plt.savefig(os.path.join(args.results_path, 'classification_metrics.png'))
159
+ plt.show()
160
+ print(f"Visualization saved to {os.path.join(args.results_path, 'classification_metrics.png')}")
161
+
162
+ # Display and save the confusion matrix
163
+ from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
164
+ conf_matrix = confusion_matrix(y_test, y_pred)
165
+ disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
166
+ disp.plot(cmap=plt.cm.Blues, values_format='d')
167
+ plt.title(f'{model_name} Confusion Matrix')
168
+ conf_matrix_path = os.path.join(args.results_path, 'confusion_matrix.png')
169
+ plt.savefig(conf_matrix_path)
170
+ plt.show()
171
+ print(f"Confusion matrix saved to {conf_matrix_path}")
172
+
173
+ if __name__ == "__main__":
174
+ parser = argparse.ArgumentParser(description="Train a classification model.")
175
+ # Model module argument
176
+ parser.add_argument('--model_module', type=str, required=True,
177
+ help='Name of the classification model module to import.')
178
+ # Data arguments
179
+ parser.add_argument('--data_path', type=str, required=True,
180
+ help='Path to the dataset file including data name.')
181
+ parser.add_argument('--target_variable', type=str, required=True,
182
+ help='Name of the target variable (categorical).')
183
+ parser.add_argument('--drop_columns', type=str, default='',
184
+ help='Columns to drop from the dataset.')
185
+ # Model arguments
186
+ parser.add_argument('--test_size', type=float, default=0.2,
187
+ help='Proportion for test split.')
188
+ parser.add_argument('--random_state', type=int, default=42,
189
+ help='Random seed.')
190
+ parser.add_argument('--cv_folds', type=int, default=5,
191
+ help='Number of cross-validation folds.')
192
+ parser.add_argument('--scoring_metric', type=str, default=None,
193
+ help='Scoring metric for model evaluation (e.g., accuracy, f1, roc_auc).')
194
+ # Output arguments
195
+ parser.add_argument('--model_path', type=str, default=None,
196
+ help='Path to save the trained model.')
197
+ parser.add_argument('--results_path', type=str, default=None,
198
+ help='Path to save results and metrics.')
199
+ parser.add_argument('--visualize', action='store_true',
200
+ help='Generate and save visualizations (classification metrics chart and confusion matrix).')
201
+
202
+ args = parser.parse_args()
203
+ main(args)
utils/README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Utils
2
+
3
+ This directory contains utility scripts and helper functions that are used throughout the project. These scripts provide common functionalities such as data preprocessing, hyperparameter tuning, and other support functions that assist in model training and evaluation for both regression and classification tasks.
4
+
5
+ ## Contents
6
+
7
+ - [`supervised_hyperparameter_tuning.py`](#supervised_hyperparameter_tuningpy)
8
+
9
+ ### `supervised_hyperparameter_tuning.py`
10
+
11
+ This script contains functions for performing hyperparameter tuning on supervised learning models (both regression and classification) using scikit-learn's `Pipeline` and `GridSearchCV`.
12
+
13
+ #### Functions
14
+
15
+ - **`regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`**
16
+
17
+ Performs hyperparameter tuning for regression models.
18
+
19
+ **Parameters:**
20
+ - `X`: Feature matrix (pd.DataFrame).
21
+ - `y`: Numeric target variable (pd.Series).
22
+ - `estimator`: A scikit-learn regressor (e.g., `LinearRegression()`).
23
+ - `param_grid`: Dict with parameter names and lists of values.
24
+ - `cv`: Number of cross-validation folds (default 5).
25
+ - `scoring`: Scoring metric (e.g. 'neg_root_mean_squared_error').
26
+
27
+ **Returns:**
28
+ - `best_model`: Pipeline with best found hyperparameters.
29
+ - `best_params`: Dictionary of best hyperparameters.
30
+
31
+ - **`classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`**
32
+
33
+ Performs hyperparameter tuning for classification models.
34
+
35
+ **Parameters:**
36
+ - `X`: Feature matrix (pd.DataFrame).
37
+ - `y`: Target variable for classification (pd.Series), can be binary or multi-class.
38
+ - `estimator`: A scikit-learn classifier (e.g., `LogisticRegression()`, `RandomForestClassifier()`).
39
+ - `param_grid`: Dict with parameter names and lists of values.
40
+ - `cv`: Number of cross-validation folds (default 5).
41
+ - `scoring`: Scoring metric (e.g. 'accuracy', 'f1', 'f1_macro', 'roc_auc').
42
+
43
+ **Returns:**
44
+ - `best_model`: Pipeline with best found hyperparameters.
45
+ - `best_params`: Dictionary of best hyperparameters.
46
+
47
+ #### Usage Examples
48
+
49
+ **Regression Example:**
50
+ ```python
51
+ from utils.supervised_hyperparameter_tuning import regression_hyperparameter_tuning
52
+ from sklearn.linear_model import LinearRegression
53
+
54
+ X = ... # Your regression features
55
+ y = ... # Your numeric target variable
56
+ param_grid = {
57
+ 'model__fit_intercept': [True, False]
58
+ # Add other parameters if needed
59
+ }
60
+
61
+ best_model, best_params = regression_hyperparameter_tuning(X, y, LinearRegression(), param_grid, scoring='neg_root_mean_squared_error')
62
+ ```
63
+
64
+ **Classification Example (Binary or Multi-Class):**
65
+ ```python
66
+ from utils.supervised_hyperparameter_tuning import classification_hyperparameter_tuning
67
+ from sklearn.ensemble import RandomForestClassifier
68
+
69
+ X = ... # Your classification features
70
+ y = ... # Your categorical target variable (binary or multi-class)
71
+ param_grid = {
72
+ 'model__n_estimators': [100, 200],
73
+ 'model__max_depth': [None, 10]
74
+ }
75
+
76
+ best_model, best_params = classification_hyperparameter_tuning(X, y, RandomForestClassifier(), param_grid, scoring='accuracy')
77
+ ```
utils/supervised_hyperparameter_tuning.py ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+ This module provides functions for hyperparameter tuning with preprocessing using scikit-learn's GridSearchCV
4
+ for both regression and classification tasks.
5
+
6
+ Features:
7
+ - Handles numerical and categorical preprocessing using pipelines.
8
+ - Automates hyperparameter tuning for any scikit-learn estimator.
9
+ - Uses GridSearchCV for cross-validation and hyperparameter search.
10
+ - Applies algorithm-specific preprocessing when necessary (e.g., ordinal encoding for tree-based models).
11
+
12
+ Functions:
13
+ - regression_hyperparameter_tuning: For regression models.
14
+ - classification_hyperparameter_tuning: For classification models.
15
+
16
+ Example Usage (Regression):
17
+ from sklearn.ensemble import RandomForestRegressor
18
+ from supervised_hyperparameter_tuning import regression_hyperparameter_tuning
19
+
20
+ X = ... # Your feature DataFrame
21
+ y = ... # Your numeric target variable
22
+ param_grid = {
23
+ 'model__n_estimators': [100, 200],
24
+ 'model__max_depth': [None, 10]
25
+ }
26
+ best_model, best_params = regression_hyperparameter_tuning(X, y, RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
27
+
28
+ Example Usage (Classification):
29
+ from sklearn.ensemble import RandomForestClassifier
30
+ from supervised_hyperparameter_tuning import classification_hyperparameter_tuning
31
+
32
+ X = ... # Your feature DataFrame
33
+ y = ... # Your target variable (categorical)
34
+ param_grid = {
35
+ 'model__n_estimators': [100, 200],
36
+ 'model__max_depth': [None, 10]
37
+ }
38
+ best_model, best_params = classification_hyperparameter_tuning(X, y, RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
39
+ """
40
+
41
+ from sklearn.compose import ColumnTransformer
42
+ from sklearn.impute import SimpleImputer
43
+ from sklearn.pipeline import Pipeline
44
+ from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
45
+ from sklearn.model_selection import GridSearchCV, KFold
46
+
47
+ def regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None):
48
+ """
49
+ Performs hyperparameter tuning for a given regression model using GridSearchCV with preprocessing.
50
+
51
+ Args:
52
+ X (pd.DataFrame): Features.
53
+ y (pd.Series): Target variable.
54
+ estimator: The scikit-learn regressor to use (e.g., LinearRegression(), RandomForestRegressor()).
55
+ param_grid (dict): Hyperparameter grid for GridSearchCV.
56
+ cv (int or cross-validation generator): Number of cross-validation folds or a cross-validation generator.
57
+ scoring (str or None): Scoring metric to use.
58
+
59
+ Returns:
60
+ best_model (Pipeline): Best model within a pipeline from GridSearch.
61
+ best_params (dict): Best hyperparameters.
62
+ """
63
+ # Identify numerical and categorical columns
64
+ numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
65
+ categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
66
+
67
+ # Define preprocessing for numerical data
68
+ numerical_transformer = Pipeline(steps=[
69
+ ('imputer', SimpleImputer(strategy='median')),
70
+ ('scaler', StandardScaler())
71
+ ])
72
+
73
+ # Conditional preprocessing for categorical data
74
+ estimator_name = estimator.__class__.__name__
75
+
76
+ if estimator_name in [
77
+ 'DecisionTreeRegressor', 'RandomForestRegressor', 'ExtraTreesRegressor',
78
+ 'GradientBoostingRegressor', 'XGBRegressor', 'LGBMRegressor', 'CatBoostRegressor'
79
+ ]:
80
+ # Use Ordinal Encoding for tree-based models
81
+ categorical_transformer = Pipeline(steps=[
82
+ ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
83
+ ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
84
+ ])
85
+ else:
86
+ # Use OneHotEncoder for other models
87
+ categorical_transformer = Pipeline(steps=[
88
+ ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
89
+ ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
90
+ ])
91
+
92
+ # Create preprocessing pipeline
93
+ preprocessor = ColumnTransformer(
94
+ transformers=[
95
+ ('num', numerical_transformer, numerical_cols),
96
+ ('cat', categorical_transformer, categorical_cols)
97
+ ]
98
+ )
99
+
100
+ # Create a pipeline that combines preprocessing and the estimator
101
+ pipeline = Pipeline(steps=[
102
+ ('preprocessor', preprocessor),
103
+ ('model', estimator)
104
+ ])
105
+
106
+ # Define cross-validation strategy
107
+ if isinstance(cv, int):
108
+ cv = KFold(n_splits=cv, shuffle=True, random_state=42)
109
+
110
+ # Initialize GridSearchCV
111
+ grid_search = GridSearchCV(
112
+ estimator=pipeline,
113
+ param_grid=param_grid,
114
+ cv=cv,
115
+ scoring=scoring,
116
+ n_jobs=-1
117
+ )
118
+
119
+ # Perform Grid Search
120
+ grid_search.fit(X, y)
121
+
122
+ # Get the best model and parameters
123
+ best_model = grid_search.best_estimator_
124
+ best_params = grid_search.best_params_
125
+
126
+ print(f"Best Hyperparameters for {estimator_name}:")
127
+ for param_name in sorted(best_params.keys()):
128
+ print(f"{param_name}: {best_params[param_name]}")
129
+
130
+ return best_model, best_params
131
+
132
+ def classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None):
133
+ """
134
+ Performs hyperparameter tuning for a given classification model using GridSearchCV with preprocessing.
135
+
136
+ This function is similar to the regression one but adapted for classification tasks. It can handle both
137
+ binary and multi-class classification. The choice of scoring metric (e.g., 'accuracy', 'f1', 'f1_macro', 'roc_auc')
138
+ will determine how we evaluate the model, but the pipeline structure remains the same.
139
+
140
+ Args:
141
+ X (pd.DataFrame): Features.
142
+ y (pd.Series): Target variable (categorical) for classification (can be binary or multi-class).
143
+ estimator: The scikit-learn classifier to use (e.g., LogisticRegression(), RandomForestClassifier()).
144
+ param_grid (dict): Hyperparameter grid for GridSearchCV.
145
+ cv (int or cross-validation generator): Number of cross-validation folds or a CV generator.
146
+ scoring (str or None): Scoring metric (e.g., 'accuracy' for binary or multi-class, 'f1_macro' for multi-class).
147
+
148
+ Returns:
149
+ best_model (Pipeline): Best model within a pipeline from GridSearch.
150
+ best_params (dict): Best hyperparameters.
151
+ """
152
+ # Identify numerical and categorical columns
153
+ numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
154
+ categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
155
+
156
+ # Define preprocessing for numerical data
157
+ numerical_transformer = Pipeline(steps=[
158
+ ('imputer', SimpleImputer(strategy='median')),
159
+ ('scaler', StandardScaler())
160
+ ])
161
+
162
+ # Determine encoding strategy based on model type (tree-based vs. others)
163
+ estimator_name = estimator.__class__.__name__
164
+ tree_based_classifiers = [
165
+ 'DecisionTreeClassifier', 'RandomForestClassifier', 'ExtraTreesClassifier',
166
+ 'GradientBoostingClassifier', 'XGBClassifier', 'LGBMClassifier', 'CatBoostClassifier'
167
+ ]
168
+
169
+ if estimator_name in tree_based_classifiers:
170
+ categorical_transformer = Pipeline(steps=[
171
+ ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
172
+ ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
173
+ ])
174
+ else:
175
+ categorical_transformer = Pipeline(steps=[
176
+ ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
177
+ ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
178
+ ])
179
+
180
+ # Create preprocessing pipeline
181
+ preprocessor = ColumnTransformer(transformers=[
182
+ ('num', numerical_transformer, numerical_cols),
183
+ ('cat', categorical_transformer, categorical_cols)
184
+ ])
185
+
186
+ # Combine preprocessing and estimator in a pipeline
187
+ pipeline = Pipeline(steps=[
188
+ ('preprocessor', preprocessor),
189
+ ('model', estimator)
190
+ ])
191
+
192
+ # Define cross-validation strategy
193
+ if isinstance(cv, int):
194
+ cv = KFold(n_splits=cv, shuffle=True, random_state=42)
195
+
196
+ # GridSearchCV for classification
197
+ grid_search = GridSearchCV(
198
+ estimator=pipeline,
199
+ param_grid=param_grid,
200
+ cv=cv,
201
+ scoring=scoring,
202
+ n_jobs=-1
203
+ )
204
+
205
+ grid_search.fit(X, y)
206
+ best_model = grid_search.best_estimator_
207
+ best_params = grid_search.best_params_
208
+
209
+ print(f"Best Hyperparameters for {estimator_name}:")
210
+ for param_name in sorted(best_params.keys()):
211
+ print(f"{param_name}: {best_params[param_name]}")
212
+
213
+ return best_model, best_params