Spaces:

mboukabous
/

train_unsupervised

Sleeping

App Files Files Community

train_unsupervised / scripts /README.md

mboukabous

first commit

4c91838 about 1 year ago

preview code

raw

history blame contribute delete

8.79 kB

	# Scripts

	This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.

	## Contents

	Supervised Learning:
	- [train_regression_model.py](#train_regression_modelpy)
	- [train_classification_model.py](#train_classification_modelpy)

	Unsupervised Learning:
	- [train_clustering_model.py](#train_clustering_modelpy)
	- [train_dimred_model.py](#train_dimred_modelpy)
	- [train_anomaly_detection.py](#train_anomaly_detectionpy)

	---

	## `train_regression_model.py`

	A script for training supervised learning regression models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.

	### Features

	- Supports various regression models defined in `models/supervised/regression`.
	- Performs hyperparameter tuning using grid search cross-validation.
	- Saves trained models and evaluation metrics.
	- Generates visualizations if specified.

	### Usage

	```bash
	python train_regression_model.py --model_module MODEL_MODULE \
	--data_path DATA_PATH/DATA_NAME.csv \
	--target_variable TARGET_VARIABLE [OPTIONS]
	```

	Required Arguments:
	- `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
	- `data_path`: Path to the dataset directory, including the data file name.
	- `target_variable`: Name of the target variable.

	Optional Arguments:
	- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
	- `random_state`: Random seed for reproducibility (default: `42`).
	- `log_transform`: Apply log transformation to the target variable (regression only).
	- `cv_folds`: Number of cross-validation folds (default: `5`).
	- `scoring_metric`: Scoring metric for model evaluation.
	- `model_path`: Path to save the trained model.
	- `results_path`: Path to save results and metrics.
	- `visualize`: Generate and save visualizations (e.g., scatter or actual vs. predicted).
	- `drop_columns`: Comma-separated column names to drop from the dataset.

	### Usage Example

	```bash
	python train_regression_model.py --model_module linear_regression \
	--data_path data/house_prices/train.csv \
	--target_variable SalePrice --drop_columns Id \
	--log_transform --visualize
	```

	---

	## `train_classification_model.py`

	A script for training supervised learning classification models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.

	### Features

	- Supports various classification models defined in `models/supervised/classification`.
	- Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
	- Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
	- If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.

	### Usage

	```bash
	python train_classification_model.py --model_module MODEL_MODULE \
	--data_path DATA_PATH/DATA_NAME.csv \
	--target_variable TARGET_VARIABLE [OPTIONS]
	```

	Required Arguments:
	- `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
	- `data_path`: Path to the dataset directory, including the data file name.
	- `target_variable`: Name of the target variable (categorical).

	Optional Arguments:
	- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
	- `random_state`: Random seed for reproducibility (default: `42`).
	- `cv_folds`: Number of cross-validation folds (default: `5`).
	- `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
	- `model_path`: Path to save the trained model.
	- `results_path`: Path to save results and metrics.
	- `visualize`: Generate and save visualizations (metrics bar chart, confusion matrix).
	- `drop_columns`: Comma-separated column names to drop from the dataset.

	### Usage Example

	```bash
	python train_classification_model.py --model_module logistic_regression \
	--data_path data/adult_income/train.csv \
	--target_variable income_bracket \
	--scoring_metric accuracy --visualize
	```

	---

	## `train_clustering_model.py`

	A script for training clustering models (K-Means, DBSCAN, Gaussian Mixture, etc.) in an unsupervised manner. It supports data loading, optional drop/select of columns, label encoding for non-numeric features, optional hyperparameter tuning (silhouette-based), saving the final model, and generating a 2D cluster plot if needed.

	### Features

	- Supports various clustering models defined in `models/unsupervised/clustering`.
	- Optional hyperparameter tuning (silhouette score) via `clustering_hyperparameter_tuning`.
	- Saves the trained clustering model and optional silhouette metrics.
	- Generates a 2D scatter plot if `visualize` is enabled (using PCA if needed).

	### Usage

	```bash
	python train_clustering_model.py --model_module MODEL_MODULE \
	--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
	```

	Key Arguments:
	- `model_module`: Name of the clustering model module (e.g., `kmeans`, `dbscan`, `gaussian_mixture`).
	- `data_path`: Path to the CSV dataset.

	Optional Arguments:
	- `drop_columns`: Comma-separated column names to drop.
	- `select_columns`: Comma-separated column names to keep.
	- `tune`: If set, performs silhouette-based hyperparameter tuning.
	- `cv_folds`: Number of folds or times for silhouette-based repeated runs (basic approach).
	- `scoring_metric`: Typically `'silhouette'`.
	- `visualize`: If set, attempts a 2D scatter, using PCA if more than 2 features remain.
	- `model_path`: Path to save the trained model.
	- `results_path`: Path to save results (metrics, plots).

	### Usage Example

	```bash
	python train_clustering_model.py \
	--model_module kmeans \
	--data_path data/mall_customer/Mall_Customers.csv \
	--drop_columns "Gender" \
	--select_columns "Annual Income (k$),Spending Score (1-100)" \
	--visualize
	```

	---

	## `train_dimred_model.py`

	A script for dimensionality reduction tasks (e.g., PCA, t-SNE, UMAP). It loads data, optionally drops or selects columns, label-encodes categorical features, fits the chosen dimensionality reduction model, saves the transformed data, and can visualize 2D/3D outputs.

	### Features

	- Supports various dimension reduction models in `models/unsupervised/dimred`.
	- Saves the fitted model and the transformed data (in CSV).
	- Optionally creates a 2D or 3D scatter plot if the output dimension is 2 or 3.

	### Usage

	```bash
	python train_dimred_model.py --model_module MODEL_MODULE \
	--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
	```

	Key Arguments:
	- `model_module`: Name of the dimension reduction module (e.g., `pca`, `tsne`, `umap`).
	- `data_path`: Path to the CSV dataset.

	Optional Arguments:
	- `drop_columns`: Comma-separated column names to drop.
	- `select_columns`: Comma-separated column names to keep.
	- `visualize`: If set, plots the 2D or 3D embedding.
	- `model_path`: Path to save the trained model.
	- `results_path`: Path to save the transformed data and any plots.

	### Usage Example

	```bash
	python train_dimred_model.py \
	--model_module pca \
	--data_path data/breast_cancer/data.csv \
	--drop_columns "id,diagnosis" \
	--visualize
	```

	---

	## `train_anomaly_detection.py`

	A script for training anomaly/outlier detection models (Isolation Forest, One-Class SVM, etc.). It supports dropping/selecting columns, label-encoding, saving anomaly predictions (0 = normal, 1 = outlier), and optionally visualizing points in 2D with outliers colored differently.

	### Features

	- Supports various anomaly models in `models/unsupervised/anomaly`.
	- Saves the model and an outlier predictions CSV.
	- If `visualize` is enabled, performs PCA → 2D for plotting normal vs. outliers.

	### Usage

	```bash
	python train_anomaly_detection.py --model_module MODEL_MODULE \
	--data_path DATA_PATH/DATA_NAME.csv [OPTIONS]
	```

	Key Arguments:
	- `model_module`: Name of the anomaly detection module (e.g., `isolation_forest`, `one_class_svm`, `local_outlier_factor`).
	- `data_path`: Path to the CSV dataset.

	Optional Arguments:
	- `drop_columns`: Comma-separated column names to drop.
	- `select_columns`: Comma-separated column names to keep.
	- `visualize`: If set, attempts a 2D scatter (via PCA) and colors outliers in red.
	- `model_path`: Path to save the anomaly model.
	- `results_path`: Path to save outlier predictions and plots.

	### Usage Example

	```bash
	python train_anomaly_detection.py \
	--model_module isolation_forest \
	--data_path data/breast_cancer/data.csv \
	--drop_columns "id,diagnosis" \
	--visualize
	```