File size: 8,786 Bytes
4c91838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Scripts

This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.

## Contents

Supervised Learning:
- [train_regression_model.py](#train_regression_modelpy)
- [train_classification_model.py](#train_classification_modelpy)

Unsupervised Learning:
- [train_clustering_model.py](#train_clustering_modelpy)
- [train_dimred_model.py](#train_dimred_modelpy)
- [train_anomaly_detection.py](#train_anomaly_detectionpy)

---

## `train_regression_model.py`

A script for training supervised learning **regression** models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.

### Features

- Supports various regression models defined in `models/supervised/regression`.
- Performs hyperparameter tuning using grid search cross-validation.
- Saves trained models and evaluation metrics.
- Generates visualizations if specified.

### Usage

```bash
python train_regression_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]
```

**Required Arguments**:
- `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable.

**Optional Arguments**:
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `log_transform`: Apply log transformation to the target variable (regression only).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations (e.g., scatter or actual vs. predicted).
- `drop_columns`: Comma-separated column names to drop from the dataset.

### Usage Example

```bash
python train_regression_model.py --model_module linear_regression \
    --data_path data/house_prices/train.csv \
    --target_variable SalePrice --drop_columns Id \
    --log_transform --visualize
```

---

## `train_classification_model.py`

A script for training supervised learning **classification** models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.

### Features

- Supports various classification models defined in `models/supervised/classification`.
- Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
- Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
- If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.

### Usage

```bash
python train_classification_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]
```

**Required Arguments**:
- `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable (categorical).

**Optional Arguments**:
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations (metrics bar chart, confusion matrix).
- `drop_columns`: Comma-separated column names to drop from the dataset.

### Usage Example

```bash
python train_classification_model.py --model_module logistic_regression \
    --data_path data/adult_income/train.csv \
    --target_variable income_bracket \
    --scoring_metric accuracy --visualize
```

---

## `train_clustering_model.py`

A script for training **clustering** models (K-Means, DBSCAN, Gaussian Mixture, etc.) in an unsupervised manner. It supports data loading, optional drop/select of columns, label encoding for non-numeric features, optional hyperparameter tuning (silhouette-based), saving the final model, and generating a 2D cluster plot if needed.

### Features

- Supports various clustering models defined in `models/unsupervised/clustering`.
- Optional hyperparameter tuning (silhouette score) via `clustering_hyperparameter_tuning`.
- Saves the trained clustering model and optional silhouette metrics.
- Generates a 2D scatter plot if `visualize` is enabled (using PCA if needed).

### Usage

```bash
python train_clustering_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```

**Key Arguments**:
- `model_module`: Name of the clustering model module (e.g., `kmeans`, `dbscan`, `gaussian_mixture`).
- `data_path`: Path to the CSV dataset.

**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `tune`: If set, performs silhouette-based hyperparameter tuning.
- `cv_folds`: Number of folds or times for silhouette-based repeated runs (basic approach).
- `scoring_metric`: Typically `'silhouette'`.
- `visualize`: If set, attempts a 2D scatter, using PCA if more than 2 features remain.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results (metrics, plots).

### Usage Example

```bash
python train_clustering_model.py \
  --model_module kmeans \
  --data_path data/mall_customer/Mall_Customers.csv \
  --drop_columns "Gender" \
  --select_columns "Annual Income (k$),Spending Score (1-100)" \
  --visualize
```

---

## `train_dimred_model.py`

A script for **dimensionality reduction** tasks (e.g., PCA, t-SNE, UMAP). It loads data, optionally drops or selects columns, label-encodes categorical features, fits the chosen dimensionality reduction model, saves the transformed data, and can visualize 2D/3D outputs.

### Features

- Supports various dimension reduction models in `models/unsupervised/dimred`.
- Saves the fitted model and the transformed data (in CSV).
- Optionally creates a 2D or 3D scatter plot if the output dimension is 2 or 3.

### Usage

```bash
python train_dimred_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```

**Key Arguments**:
- `model_module`: Name of the dimension reduction module (e.g., `pca`, `tsne`, `umap`).
- `data_path`: Path to the CSV dataset.

**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `visualize`: If set, plots the 2D or 3D embedding.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save the transformed data and any plots.

### Usage Example

```bash
python train_dimred_model.py \
  --model_module pca \
  --data_path data/breast_cancer/data.csv \
  --drop_columns "id,diagnosis" \
  --visualize
```

---

## `train_anomaly_detection.py`

A script for training **anomaly/outlier detection** models (Isolation Forest, One-Class SVM, etc.). It supports dropping/selecting columns, label-encoding, saving anomaly predictions (0 = normal, 1 = outlier), and optionally visualizing points in 2D with outliers colored differently.

### Features

- Supports various anomaly models in `models/unsupervised/anomaly`.
- Saves the model and an outlier predictions CSV.
- If `visualize` is enabled, performs PCA → 2D for plotting normal vs. outliers.

### Usage

```bash
python train_anomaly_detection.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```

**Key Arguments**:
- `model_module`: Name of the anomaly detection module (e.g., `isolation_forest`, `one_class_svm`, `local_outlier_factor`).
- `data_path`: Path to the CSV dataset.

**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `visualize`: If set, attempts a 2D scatter (via PCA) and colors outliers in red.
- `model_path`: Path to save the anomaly model.
- `results_path`: Path to save outlier predictions and plots.

### Usage Example

```bash
python train_anomaly_detection.py \
  --model_module isolation_forest \
  --data_path data/breast_cancer/data.csv \
  --drop_columns "id,diagnosis" \
  --visualize
```