Spaces:

mboukabous
/

train_unsupervised

Sleeping

App Files Files Community

train_unsupervised / utils /README.md

mboukabous

first commit

4c91838 about 1 year ago

preview code

raw

history blame contribute delete

5.69 kB

A newer version of the Gradio SDK is available: 6.5.0

Upgrade

Utils

This directory contains utility scripts and helper functions that are used throughout the project. These scripts provide common functionalities such as data preprocessing, hyperparameter tuning, and other support functions that assist in model training and evaluation for supervised (regression and classification) as well as unsupervised (clustering) tasks.

supervised_hyperparameter_tuning.py
unsupervised_hyperparameter_tuning.py

`supervised_hyperparameter_tuning.py`

This script contains functions for performing hyperparameter tuning on supervised learning models (both regression and classification) using scikit-learn's Pipeline and GridSearchCV.

Functions

`regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`

Performs hyperparameter tuning for regression models.

Parameters:
- X (pd.DataFrame): Feature matrix.
- y (pd.Series): Numeric target variable.
- estimator: A scikit-learn regressor (e.g., LinearRegression()).
- param_grid (dict): Parameter names and lists of values (e.g. {'model__fit_intercept': [True, False]}).
- cv (int): Number of cross-validation folds (default 5).
- scoring (str): Scoring metric (e.g., 'neg_root_mean_squared_error').
Returns:
- best_model: The pipeline with the best hyperparameters.
- best_params (dict): The dictionary of best hyperparameters.

Example:

from utils.supervised_hyperparameter_tuning import regression_hyperparameter_tuning
from sklearn.linear_model import LinearRegression

X = ...  # Your regression features
y = ...  # Your numeric target variable
param_grid = {
    'model__fit_intercept': [True, False]
}

best_model, best_params = regression_hyperparameter_tuning(
    X, y, LinearRegression(), param_grid, scoring='neg_root_mean_squared_error'
)

`classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`

Performs hyperparameter tuning for classification models.

Parameters:
- X (pd.DataFrame): Feature matrix.
- y (pd.Series): Target variable (binary or multi-class).
- estimator: A scikit-learn classifier (e.g., LogisticRegression(), RandomForestClassifier()).
- param_grid (dict): Parameter names and lists of values (e.g. {'model__n_estimators': [100, 200]}).
- cv (int): Number of cross-validation folds (default 5).
- scoring (str): Scoring metric (e.g., 'accuracy', 'f1', 'roc_auc').
Returns:
- best_model: The pipeline with the best hyperparameters.
- best_params (dict): The dictionary of best hyperparameters.

Example:

from utils.supervised_hyperparameter_tuning import classification_hyperparameter_tuning
from sklearn.ensemble import RandomForestClassifier

X = ...  # Your classification features
y = ...  # Binary or multi-class labels
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10]
}

best_model, best_params = classification_hyperparameter_tuning(
    X, y, RandomForestClassifier(), param_grid, scoring='accuracy'
)

`unsupervised_hyperparameter_tuning.py`

This script provides a function for hyperparameter tuning of clustering models using silhouette score as the objective metric. Unlike supervised approaches, clustering does not have labeled data, so the silhouette score is used to measure how well-separated the clusters are.

Functions

`clustering_hyperparameter_tuning(X, estimator, param_grid, scoring='silhouette', cv=5)`

A simple manual hyperparameter search for clustering models.

Parameters:
- X (array-like): Feature matrix for clustering.
- estimator: A scikit-learn clustering estimator supporting .fit(X) and either .labels_ or .predict(X) (e.g., KMeans, DBSCAN, GaussianMixture).
- param_grid (dict): Dictionary of hyperparams (e.g., {'model__n_clusters': [2,3,4]}).
- scoring (str): Only 'silhouette' is supported.
- cv (int): Optionally, you could do repeated subsampling or advanced logic for more stable estimates, but the default implementation does a single fit.
Returns:
- best_estimator: The fitted estimator with the best silhouette score.
- best_params (dict): The dictionary of best hyperparameters found.

Key Steps:

Parameter Loop: For each combination of parameters in ParameterGrid(param_grid), clone and fit the estimator.
Retrieve Labels: If the estimator has .labels_, use it; otherwise use .predict(X).
Compute Silhouette: If more than one cluster is found, calculate silhouette_score(X, labels).
Track the Best: Keep track of the parameter set yielding the highest silhouette score.
Fallback: If no valid parameter combos produce more than one cluster, it falls back to the original estimator.

Example:

from utils.unsupervised_hyperparameter_tuning import clustering_hyperparameter_tuning
from sklearn.cluster import KMeans

X = ...  # Your numeric data for clustering
param_grid = {
    'model__n_clusters': [2, 3, 4],
    'model__init': ['k-means++', 'random']
}

best_model, best_params = clustering_hyperparameter_tuning(
    X, KMeans(random_state=42), param_grid, scoring='silhouette'
)
print("Best Silhouette Score found:", best_params)

Note: This approach is simpler than using GridSearchCV for clustering because unsupervised tasks do not have a “true” label. The silhouette score is a common measure, but you could adapt the function for other internal cluster metrics if desired.

Utils

Contents

supervised_hyperparameter_tuning.py

Functions

regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)

classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)

unsupervised_hyperparameter_tuning.py

Functions

clustering_hyperparameter_tuning(X, estimator, param_grid, scoring='silhouette', cv=5)

`supervised_hyperparameter_tuning.py`

`regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`

`classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`

`unsupervised_hyperparameter_tuning.py`

`clustering_hyperparameter_tuning(X, estimator, param_grid, scoring='silhouette', cv=5)`