Spaces:
Sleeping
Sleeping
File size: 8,786 Bytes
4c91838 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# Scripts
This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.
## Contents
Supervised Learning:
- [train_regression_model.py](#train_regression_modelpy)
- [train_classification_model.py](#train_classification_modelpy)
Unsupervised Learning:
- [train_clustering_model.py](#train_clustering_modelpy)
- [train_dimred_model.py](#train_dimred_modelpy)
- [train_anomaly_detection.py](#train_anomaly_detectionpy)
---
## `train_regression_model.py`
A script for training supervised learning **regression** models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.
### Features
- Supports various regression models defined in `models/supervised/regression`.
- Performs hyperparameter tuning using grid search cross-validation.
- Saves trained models and evaluation metrics.
- Generates visualizations if specified.
### Usage
```bash
python train_regression_model.py --model_module MODEL_MODULE \
--data_path DATA_PATH/DATA_NAME.csv \
--target_variable TARGET_VARIABLE [OPTIONS]
```
**Required Arguments**:
- `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable.
**Optional Arguments**:
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `log_transform`: Apply log transformation to the target variable (regression only).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations (e.g., scatter or actual vs. predicted).
- `drop_columns`: Comma-separated column names to drop from the dataset.
### Usage Example
```bash
python train_regression_model.py --model_module linear_regression \
--data_path data/house_prices/train.csv \
--target_variable SalePrice --drop_columns Id \
--log_transform --visualize
```
---
## `train_classification_model.py`
A script for training supervised learning **classification** models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.
### Features
- Supports various classification models defined in `models/supervised/classification`.
- Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
- Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
- If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.
### Usage
```bash
python train_classification_model.py --model_module MODEL_MODULE \
--data_path DATA_PATH/DATA_NAME.csv \
--target_variable TARGET_VARIABLE [OPTIONS]
```
**Required Arguments**:
- `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
- `data_path`: Path to the dataset directory, including the data file name.
- `target_variable`: Name of the target variable (categorical).
**Optional Arguments**:
- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
- `random_state`: Random seed for reproducibility (default: `42`).
- `cv_folds`: Number of cross-validation folds (default: `5`).
- `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results and metrics.
- `visualize`: Generate and save visualizations (metrics bar chart, confusion matrix).
- `drop_columns`: Comma-separated column names to drop from the dataset.
### Usage Example
```bash
python train_classification_model.py --model_module logistic_regression \
--data_path data/adult_income/train.csv \
--target_variable income_bracket \
--scoring_metric accuracy --visualize
```
---
## `train_clustering_model.py`
A script for training **clustering** models (K-Means, DBSCAN, Gaussian Mixture, etc.) in an unsupervised manner. It supports data loading, optional drop/select of columns, label encoding for non-numeric features, optional hyperparameter tuning (silhouette-based), saving the final model, and generating a 2D cluster plot if needed.
### Features
- Supports various clustering models defined in `models/unsupervised/clustering`.
- Optional hyperparameter tuning (silhouette score) via `clustering_hyperparameter_tuning`.
- Saves the trained clustering model and optional silhouette metrics.
- Generates a 2D scatter plot if `visualize` is enabled (using PCA if needed).
### Usage
```bash
python train_clustering_model.py --model_module MODEL_MODULE \
--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```
**Key Arguments**:
- `model_module`: Name of the clustering model module (e.g., `kmeans`, `dbscan`, `gaussian_mixture`).
- `data_path`: Path to the CSV dataset.
**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `tune`: If set, performs silhouette-based hyperparameter tuning.
- `cv_folds`: Number of folds or times for silhouette-based repeated runs (basic approach).
- `scoring_metric`: Typically `'silhouette'`.
- `visualize`: If set, attempts a 2D scatter, using PCA if more than 2 features remain.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save results (metrics, plots).
### Usage Example
```bash
python train_clustering_model.py \
--model_module kmeans \
--data_path data/mall_customer/Mall_Customers.csv \
--drop_columns "Gender" \
--select_columns "Annual Income (k$),Spending Score (1-100)" \
--visualize
```
---
## `train_dimred_model.py`
A script for **dimensionality reduction** tasks (e.g., PCA, t-SNE, UMAP). It loads data, optionally drops or selects columns, label-encodes categorical features, fits the chosen dimensionality reduction model, saves the transformed data, and can visualize 2D/3D outputs.
### Features
- Supports various dimension reduction models in `models/unsupervised/dimred`.
- Saves the fitted model and the transformed data (in CSV).
- Optionally creates a 2D or 3D scatter plot if the output dimension is 2 or 3.
### Usage
```bash
python train_dimred_model.py --model_module MODEL_MODULE \
--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```
**Key Arguments**:
- `model_module`: Name of the dimension reduction module (e.g., `pca`, `tsne`, `umap`).
- `data_path`: Path to the CSV dataset.
**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `visualize`: If set, plots the 2D or 3D embedding.
- `model_path`: Path to save the trained model.
- `results_path`: Path to save the transformed data and any plots.
### Usage Example
```bash
python train_dimred_model.py \
--model_module pca \
--data_path data/breast_cancer/data.csv \
--drop_columns "id,diagnosis" \
--visualize
```
---
## `train_anomaly_detection.py`
A script for training **anomaly/outlier detection** models (Isolation Forest, One-Class SVM, etc.). It supports dropping/selecting columns, label-encoding, saving anomaly predictions (0 = normal, 1 = outlier), and optionally visualizing points in 2D with outliers colored differently.
### Features
- Supports various anomaly models in `models/unsupervised/anomaly`.
- Saves the model and an outlier predictions CSV.
- If `visualize` is enabled, performs PCA → 2D for plotting normal vs. outliers.
### Usage
```bash
python train_anomaly_detection.py --model_module MODEL_MODULE \
--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
```
**Key Arguments**:
- `model_module`: Name of the anomaly detection module (e.g., `isolation_forest`, `one_class_svm`, `local_outlier_factor`).
- `data_path`: Path to the CSV dataset.
**Optional Arguments**:
- `drop_columns`: Comma-separated column names to drop.
- `select_columns`: Comma-separated column names to keep.
- `visualize`: If set, attempts a 2D scatter (via PCA) and colors outliers in red.
- `model_path`: Path to save the anomaly model.
- `results_path`: Path to save outlier predictions and plots.
### Usage Example
```bash
python train_anomaly_detection.py \
--model_module isolation_forest \
--data_path data/breast_cancer/data.csv \
--drop_columns "id,diagnosis" \
--visualize
```
|