mboukabous's picture
first commit
4c91838

A newer version of the Gradio SDK is available: 6.4.0

Upgrade

Scripts

This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.

Contents

Supervised Learning:

Unsupervised Learning:


train_regression_model.py

A script for training supervised learning regression models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.

Features

  • Supports various regression models defined in models/supervised/regression.
  • Performs hyperparameter tuning using grid search cross-validation.
  • Saves trained models and evaluation metrics.
  • Generates visualizations if specified.

Usage

python train_regression_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]

Required Arguments:

  • model_module: Name of the regression model module to import (e.g., linear_regression).
  • data_path: Path to the dataset directory, including the data file name.
  • target_variable: Name of the target variable.

Optional Arguments:

  • test_size: Proportion of the dataset to include in the test split (default: 0.2).
  • random_state: Random seed for reproducibility (default: 42).
  • log_transform: Apply log transformation to the target variable (regression only).
  • cv_folds: Number of cross-validation folds (default: 5).
  • scoring_metric: Scoring metric for model evaluation.
  • model_path: Path to save the trained model.
  • results_path: Path to save results and metrics.
  • visualize: Generate and save visualizations (e.g., scatter or actual vs. predicted).
  • drop_columns: Comma-separated column names to drop from the dataset.

Usage Example

python train_regression_model.py --model_module linear_regression \
    --data_path data/house_prices/train.csv \
    --target_variable SalePrice --drop_columns Id \
    --log_transform --visualize

train_classification_model.py

A script for training supervised learning classification models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.

Features

  • Supports various classification models defined in models/supervised/classification.
  • Performs hyperparameter tuning using grid search cross-validation (via classification_hyperparameter_tuning).
  • Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
  • If visualize is enabled, it generates a metrics bar chart and a confusion matrix plot.

Usage

python train_classification_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv \
    --target_variable TARGET_VARIABLE [OPTIONS]

Required Arguments:

  • model_module: Name of the classification model module to import (e.g., logistic_regression).
  • data_path: Path to the dataset directory, including the data file name.
  • target_variable: Name of the target variable (categorical).

Optional Arguments:

  • test_size: Proportion of the dataset to include in the test split (default: 0.2).
  • random_state: Random seed for reproducibility (default: 42).
  • cv_folds: Number of cross-validation folds (default: 5).
  • scoring_metric: Scoring metric for model evaluation (e.g., accuracy, f1, roc_auc).
  • model_path: Path to save the trained model.
  • results_path: Path to save results and metrics.
  • visualize: Generate and save visualizations (metrics bar chart, confusion matrix).
  • drop_columns: Comma-separated column names to drop from the dataset.

Usage Example

python train_classification_model.py --model_module logistic_regression \
    --data_path data/adult_income/train.csv \
    --target_variable income_bracket \
    --scoring_metric accuracy --visualize

train_clustering_model.py

A script for training clustering models (K-Means, DBSCAN, Gaussian Mixture, etc.) in an unsupervised manner. It supports data loading, optional drop/select of columns, label encoding for non-numeric features, optional hyperparameter tuning (silhouette-based), saving the final model, and generating a 2D cluster plot if needed.

Features

  • Supports various clustering models defined in models/unsupervised/clustering.
  • Optional hyperparameter tuning (silhouette score) via clustering_hyperparameter_tuning.
  • Saves the trained clustering model and optional silhouette metrics.
  • Generates a 2D scatter plot if visualize is enabled (using PCA if needed).

Usage

python train_clustering_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]

Key Arguments:

  • model_module: Name of the clustering model module (e.g., kmeans, dbscan, gaussian_mixture).
  • data_path: Path to the CSV dataset.

Optional Arguments:

  • drop_columns: Comma-separated column names to drop.
  • select_columns: Comma-separated column names to keep.
  • tune: If set, performs silhouette-based hyperparameter tuning.
  • cv_folds: Number of folds or times for silhouette-based repeated runs (basic approach).
  • scoring_metric: Typically 'silhouette'.
  • visualize: If set, attempts a 2D scatter, using PCA if more than 2 features remain.
  • model_path: Path to save the trained model.
  • results_path: Path to save results (metrics, plots).

Usage Example

python train_clustering_model.py \
  --model_module kmeans \
  --data_path data/mall_customer/Mall_Customers.csv \
  --drop_columns "Gender" \
  --select_columns "Annual Income (k$),Spending Score (1-100)" \
  --visualize

train_dimred_model.py

A script for dimensionality reduction tasks (e.g., PCA, t-SNE, UMAP). It loads data, optionally drops or selects columns, label-encodes categorical features, fits the chosen dimensionality reduction model, saves the transformed data, and can visualize 2D/3D outputs.

Features

  • Supports various dimension reduction models in models/unsupervised/dimred.
  • Saves the fitted model and the transformed data (in CSV).
  • Optionally creates a 2D or 3D scatter plot if the output dimension is 2 or 3.

Usage

python train_dimred_model.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]

Key Arguments:

  • model_module: Name of the dimension reduction module (e.g., pca, tsne, umap).
  • data_path: Path to the CSV dataset.

Optional Arguments:

  • drop_columns: Comma-separated column names to drop.
  • select_columns: Comma-separated column names to keep.
  • visualize: If set, plots the 2D or 3D embedding.
  • model_path: Path to save the trained model.
  • results_path: Path to save the transformed data and any plots.

Usage Example

python train_dimred_model.py \
  --model_module pca \
  --data_path data/breast_cancer/data.csv \
  --drop_columns "id,diagnosis" \
  --visualize

train_anomaly_detection.py

A script for training anomaly/outlier detection models (Isolation Forest, One-Class SVM, etc.). It supports dropping/selecting columns, label-encoding, saving anomaly predictions (0 = normal, 1 = outlier), and optionally visualizing points in 2D with outliers colored differently.

Features

  • Supports various anomaly models in models/unsupervised/anomaly.
  • Saves the model and an outlier predictions CSV.
  • If visualize is enabled, performs PCA → 2D for plotting normal vs. outliers.

Usage

python train_anomaly_detection.py --model_module MODEL_MODULE \
    --data_path DATA_PATH/DATA_NAME.csv [OPTIONS]

Key Arguments:

  • model_module: Name of the anomaly detection module (e.g., isolation_forest, one_class_svm, local_outlier_factor).
  • data_path: Path to the CSV dataset.

Optional Arguments:

  • drop_columns: Comma-separated column names to drop.
  • select_columns: Comma-separated column names to keep.
  • visualize: If set, attempts a 2D scatter (via PCA) and colors outliers in red.
  • model_path: Path to save the anomaly model.
  • results_path: Path to save outlier predictions and plots.

Usage Example

python train_anomaly_detection.py \
  --model_module isolation_forest \
  --data_path data/breast_cancer/data.csv \
  --drop_columns "id,diagnosis" \
  --visualize