File size: 4,025 Bytes
7c045bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# Scripts

This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.

## Contents

- [`train_regression_model.py`](#train_regression_modelpy)
- [`train_classification_model.py`](#train_classification_modelpy)

### `train_regression_model.py`

A script for training supervised learning **regression** models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.

#### Features

- Supports various regression models defined in `models/supervised/regression`.
- Performs hyperparameter tuning using grid search cross-validation.
- Saves trained models and evaluation metrics.
- Generates visualizations if specified.

#### Usage

```bash
python train_regression_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]

```

- **Required Arguments:**
- `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable.

- **Optional Arguments:**
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `log_transform`: Apply log transformation to the target variable (regression only).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations.
- `drop_columns`: Comma-separated column names to drop from the dataset.

#### Usage Example

```bash
python train_regression_model.py --model_module linear_regression \
    --data_path data/house_prices/train.csv \
    --target_variable SalePrice --drop_columns Id \
    --log_transform --visualize
```

---

### `train_classification_model.py`

A script for training supervised learning **classification** models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.

#### Features

- Supports various classification models defined in `models/supervised/classification`.
- Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
- Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
- If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.

#### Usage

```bash
python train_classification_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]

```

- **Required Arguments:**
- `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable (categorical).

- **Optional Arguments:**
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations.
- `drop_columns`: Comma-separated column names to drop from the dataset.

#### Usage Example

```bash
python train_classification_model.py --model_module logistic_regression \
    --data_path data/adult_income/train.csv \
    --target_variable income_bracket \
    --scoring_metric accuracy --visualize
```