Spaces:

mboukabous
/

train_classificator

Sleeping

App Files Files Community

mboukabous commited on Dec 19, 2024

Commit

7c045bd

1 Parent(s): 6329c3b

Add application file

Browse files

Files changed (28) hide show

README.md +208 -10
app.py +272 -0
data/README.md +1 -0
data/datasets/README.md +21 -0
data/datasets/kaggle_data.py +115 -0
data/raw/README.md +1 -0
models/README.md +1 -0
models/supervised/classification/README.md +32 -0
models/supervised/classification/adaboost_classifier.py +25 -0
models/supervised/classification/catboost_classifier.py +28 -0
models/supervised/classification/decision_tree_classifier.py +31 -0
models/supervised/classification/extra_trees_classifier.py +26 -0
models/supervised/classification/gaussian_nb.py +26 -0
models/supervised/classification/gradient_boosting_classifier.py +25 -0
models/supervised/classification/knn_classifier.py +28 -0
models/supervised/classification/lightgbm_classifier.py +27 -0
models/supervised/classification/linear_discriminant_analysis.py +28 -0
models/supervised/classification/logistic_regression.py +36 -0
models/supervised/classification/mlp_classifier.py +31 -0
models/supervised/classification/quadratic_discriminant_analysis.py +26 -0
models/supervised/classification/random_forest_classifier.py +26 -0
models/supervised/classification/svc.py +29 -0
models/supervised/classification/xgboost_classifier.py +27 -0
requirements.txt +12 -0
scripts/README.md +99 -0
scripts/train_classification_model.py +203 -0
utils/README.md +77 -0
utils/supervised_hyperparameter_tuning.py +213 -0

README.md CHANGED Viewed

@@ -1,13 +1,211 @@
 ---
-title: Train Classificator
-emoji: 👀
-colorFrom: blue
-colorTo: purple
-sdk: gradio
-sdk_version: 5.9.1
-app_file: app.py
-pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# AI-Algorithms-Made-Easy
+**Under Development**
+![Under Development](under_development.png?raw=true "Under Development")
+Welcome to **AI-Algorithms-Made-Easy**! This project is a comprehensive collection of artificial intelligence algorithms implemented from scratch using **PyTorch**. Our goal is to demystify AI by providing clear, easy-to-understand code and detailed explanations for each algorithm.
+Whether you're a beginner in machine learning or an experienced practitioner, this project offers resources to enhance your understanding and skills in AI.
+---
+## Project Description
+**AI-Algorithms-Made-Easy** aims to make AI accessible to everyone by:
+- **Intuitive Implementations**: Breaking down complex algorithms into understandable components with step-by-step code.
+- **Educational Notebooks**: Providing Jupyter notebooks that combine theory with practical examples.
+- **Interactive Demos**: Offering user-friendly interfaces built with **Gradio** to experiment with algorithms in real-time.
+- **Comprehensive Documentation**: Supplying in-depth guides and resources to support your AI learning journey.
+Our mission is to simplify the learning process and provide hands-on tools to explore and understand AI concepts effectively.
 ---
+## Table of Contents
+- [Algorithms Implemented](#algorithms-implemented)
+- [Project Structure](#project-structure)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Contributing](#contributing)
+- [License](#license)
+- [Contact](#contact)
+---
+## Algorithms Implemented
+*This project is currently under development. Stay tuned for updates!*
+### Supervised Learning (Scikit-Learn)
+#### Regression ([Documentation](docs/Regression_Documentation.md), [Interface](https://huggingface.co/spaces/mboukabous/train_regression), [Notebook](notebooks/Train_Supervised_Regression_Models.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mboukabous/AI-Algorithms-Made-Easy/blob/main/notebooks/Train_Supervised_Regression_Models.ipynb))
+- [Linear Regression](models/supervised/regression/linear_regression.py)
+- [Ridge Regression](models/supervised/regression/ridge_regression.py)
+- [Lasso Regression](models/supervised/regression/lasso_regression.py)
+- [ElasticNet Regression](models/supervised/regression/elasticnet_regression.py)
+- [Decision Tree](models/supervised/regression/decision_tree_regressor.py)
+- [Random Forest (Bagging)](models/supervised/regression/random_forest_regressor.py)
+- [Gradient Boosting (Boosting)](models/supervised/regression/gradient_boosting_regressor.py)
+- [AdaBoost (Boosting)](models/supervised/regression/adaboost_regressor.py)
+- [XGBoost (Boosting)](models/supervised/regression/xgboost_regressor.py)
+- [LightGBM](models/supervised/regression/lightgbm_regressor.py)
+- [CatBoost](models/supervised/regression/catboost_regressor.py)
+- [Support Vector Regressor (SVR)](models/supervised/regression/support_vector_regressor.py)
+- [K-Nearest Neighbors (KNN) Regressor](models/supervised/regression/knn_regressor.py)
+- [Extra Trees Regressor](models/supervised/regression/extra_trees_regressor.py)
+- [Multilayer Perceptron (MLP) Regressor](models/supervised/regression/mlp_regressor.py)
+#### Classification ([Documentation](docs/Classification_Documentation.md))
+- [Logistic Regression](models/supervised/classification/logistic_regression.py)
+- [Decision Tree Classifier](models/supervised/classification/decision_tree_classifier.py)
+- [Random Forest Classifier (Bagging)](models/supervised/classification/random_forest_classifier.py)
+- [Extra Trees Classifier](models/supervised/classification/extra_trees_classifier.py)
+- [Gradient Boosting Classifier (Boosting)](models/supervised/classification/gradient_boosting_classifier.py)
+- [AdaBoost Classifier (Boosting)](models/supervised/classification/adaboost_classifier.py)
+- [XGBoost Classifier (Boosting)](models/supervised/classification/xgboost_classifier.py)
+- [LightGBM Classifier (Boosting)](models/supervised/classification/lightgbm_classifier.py)
+- [CatBoost Classifier (Boosting)](models/supervised/classification/catboost_classifier.py)
+- [Support Vector Classifier (SVC)](models/supervised/classification/svc.py)
+- [K-Nearest Neighbors (KNN) Classifier](models/supervised/classification/knn_classifier.py)
+- [Multilayer Perceptron (MLP) Classifier](models/supervised/classification/mlp_classifier.py)
+- [GaussianNB (Naive Bayes Classifier)](models/supervised/classification/gaussian_nb.py)
+- [Linear Discriminant Analysis (LDA)](models/supervised/classification/linear_discriminant_analysis.py)
+- [Quadratic Discriminant Analysis (QDA)](models/supervised/classification/quadratic_discriminant_analysis.py)
+### Unsupervised Learning
+- K-Means Clustering
+- Principal Component Analysis (PCA)
+- Hierarchical Clustering
+- Autoencoders
+- Isolation Forest
+- Gaussian Mixture Models
+### Deep Learning (DL)
+- Convolutional Neural Networks (CNN)
+- Recurrent Neural Networks (RNN)
+- Long Short-Term Memory Networks (LSTM)
+- Gated Recurrent Unit (GRU)
+- Generative Adversarial Networks (GAN)
+- Transformers
+- Attention Mechanisms
+### Computer Vision
+- Image Classification/Transfer learning (TL)
+- Object Detection
+- Semantic Segmentation
+- Style Transfer
+- Image Captioning
+- Generative Models
+### Natural Language Processing (NLP)
+- Sentiment Analysis (SA)
+- Machine Translation
+- Named Entity Recognition (NER)
+- Text Classification
+- Text Summarization
+- Question Answering
+- Language Modeling
+- Transformer Models
+### Time Series Analysis
+- Time Series Forecasting with RNNs
+- Temporal Convolutional Networks (TCNs)
+- Transformers for Time Series
+### Reinforcement Learning
+- Q-Learning
+- Deep Q-Networks (DQN)
+- Policy Gradients
+- Actor-Critic Methods
+- Proximal Policy Optimization
+### and more ...
+---
+## Project Structure
+- **models/**: Contains all the AI algorithm implementations, organized by category.
+- **data/**: Includes datasets and data preprocessing utilities.
+- **utils/**: Utility scripts and helper functions.
+- **scripts/**: Executable scripts for training, testing, and other tasks.
+- **interfaces/**: Interactive applications using Gradio and web interfaces.
+- **notebooks/**: Jupyter notebooks for tutorials and demonstrations.
+- **deploy/**: Scripts and instructions for deploying models.
+- **website/**: Files related to the project website.
+- **docs/**: Project documentation.
+- **examples/**: Example scripts demonstrating how to use the models.
+---
+## Installation
+*Installation instructions will be provided once the initial release is available.*
+---
+## Usage
+*Usage examples and tutorials will be added as the project develops.*
+---
+## Contributing
+We welcome contributions from the community! To contribute:
+1. **Fork the repository** on GitHub.
+2. **Clone your fork** to your local machine.
+3. **Create a new branch** for your feature or bug fix.
+4. **Make your changes** and commit them with descriptive messages.
+5. **Push your changes** to your forked repository.
+6. **Open a pull request** to the main repository.
+Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.
+---
+## License
+This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
+---
+## Contact
+For questions, suggestions, or feedback:
+- **GitHub Issues**: Please open an issue on the [GitHub repository](https://github.com/mboukabous/AI-Algorithms-Made-Easy/issues).
+- **Email**: You can reach us at [m.boukabous95@gmail.com](mailto:m.boukabous95@gmail.com).
+---
+*Thank you for your interest in **AI-Algorithms-Made-Easy**! We are excited to build this resource and appreciate your support and contributions.*
+---
+## Acknowledgments
+- **PyTorch**: For providing an excellent deep learning framework.
+- **Gradio**: For simplifying the creation of interactive demos.
+- **OpenAI's ChatGPT**: For assistance in planning and drafting project materials.
+---
+## Stay Updated
+- **Watch** this repository for updates.
+- **Star** the project if you find it helpful.
+- **Share** with others who might be interested in learning AI algorithms.
 ---
+*Let's make AI accessible and easy to learn for everyone!*

app.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""
+Gradio Interface for Training Classification Models
+This script provides a Gradio-based user interface to train classification models using various datasets
+and algorithms. It allows users to select models, preprocess data, specify hyperparameters, and visualize
+results through an intuitive web interface.
+Features:
+- **Model Selection**: Choose from classification algorithms in `models/supervised/classification`.
+- **Dataset Input Options**:
+  - Upload a local CSV file.
+  - Specify a path to a dataset.
+  - Download datasets from Kaggle by uploading `kaggle.json` and specifying a competition name.
+- **Hyperparameter Customization**: Modify parameters such as test size, random state, CV folds, and scoring metric.
+- **Visualizations**: If enabled, generate classification metrics charts and confusion matrices after training.
+- **Interactive Training**: Outputs training metrics, best hyperparameters, and paths to saved models.
+Usage:
+- Place this script in `interfaces/gradio/`.
+- Ensure proper project structure and availability of `train_classification_model.py` and classification model modules.
+- Run the script. A Gradio interface will launch for interactive model training.
+Requirements:
+- Python 3.7 or higher
+- Required Python libraries as specified in `requirements.txt`
+- Properly structured project with `train_classification_model.py` and classification modules.
+"""
+import gradio as gr
+import pandas as pd
+import os
+import subprocess
+import sys
+import glob
+import re
+# Add the project root directory to the Python path
+current_dir = os.path.dirname(os.path.abspath(__file__))
+project_root = os.path.abspath(os.path.join(current_dir, '../../'))
+sys.path.append(project_root)
+def get_classification_model_modules():
+    # Get the list of available classification model modules
+    models_dir = os.path.join(project_root, 'models', 'supervised', 'classification')
+    model_files = glob.glob(os.path.join(models_dir, '*.py'))
+    print(f"Looking for model files in: {models_dir}")
+    print(f"Found model files: {model_files}")
+    models = [os.path.splitext(os.path.basename(f))[0] for f in model_files if not f.endswith('__init__.py')]
+    model_modules = [f"{model}" for model in models]
+    return model_modules
+def download_kaggle_data(json_path, competition_name):
+    # Import the get_kaggle_data function
+    from data.datasets.kaggle_data import get_kaggle_data
+    data_path = get_kaggle_data(json_path=json_path, data_name=competition_name, is_competition=True)
+    return data_path
+def train_model(model_module, data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name,
+                target_variable, drop_columns, test_size, random_state, cv_folds,
+                scoring_metric, model_save_path, results_save_path, visualize):
+    # Determine data_path
+    if data_option == 'Upload Data File':
+        if data_file is None:
+            return "Please upload a data file.", None
+        data_path = data_file  # data_file is the path to the uploaded file
+    elif data_option == 'Provide Data Path':
+        if not os.path.exists(data_path):
+            return "Provided data path does not exist.", None
+    elif data_option == 'Download from Kaggle':
+        if kaggle_json_file is None:
+            return "Please upload your kaggle.json file.", None
+        else:
+            # Save the kaggle.json file to ~/.kaggle/kaggle.json
+            import shutil
+            kaggle_config_dir = os.path.expanduser('~/.kaggle')
+            os.makedirs(kaggle_config_dir, exist_ok=True)
+            kaggle_json_path = os.path.join(kaggle_config_dir, 'kaggle.json')
+            shutil.copy(kaggle_json_file.name, kaggle_json_path)
+            os.chmod(kaggle_json_path, 0o600)
+        data_dir = download_kaggle_data(json_path=kaggle_json_path, competition_name=competition_name)
+        if data_dir is None:
+            return "Failed to download data from Kaggle.", None
+        # Use the specified data_name_kaggle
+        data_path = os.path.join(data_dir, data_name_kaggle)
+        if not os.path.exists(data_path):
+            return f"{data_name_kaggle} not found in the downloaded Kaggle data.", None
+    else:
+        return "Invalid data option selected.", None
+    # Prepare command-line arguments for train_classification_model.py
+    cmd = [sys.executable, os.path.join(project_root, 'scripts', 'train_classification_model.py')]
+    cmd.extend(['--model_module', model_module])
+    cmd.extend(['--data_path', data_path])
+    cmd.extend(['--target_variable', target_variable])
+    if drop_columns:
+        cmd.extend(['--drop_columns', ','.join(drop_columns)])
+    if test_size != 0.2:
+        cmd.extend(['--test_size', str(test_size)])
+    if random_state != 42:
+        cmd.extend(['--random_state', str(int(random_state))])
+    if cv_folds != 5:
+        cmd.extend(['--cv_folds', str(int(cv_folds))])
+    if scoring_metric:
+        cmd.extend(['--scoring_metric', scoring_metric])
+    if model_save_path:
+        cmd.extend(['--model_path', model_save_path])
+    if results_save_path:
+        cmd.extend(['--results_path', results_save_path])
+    if visualize:
+        cmd.append('--visualize')
+    print(f"Executing command: {' '.join(cmd)}")
+    try:
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        output = result.stdout
+        errors = result.stderr
+        if result.returncode != 0:
+            return f"Error during training:\n{errors}", None
+        else:
+            # Clean up output
+            output = re.sub(r"Figure\(\d+x\d+\)", "", output).strip()
+            # Attempt to find confusion_matrix.png if visualize is True
+            plot_image_path = None
+            if results_save_path:
+                # Showing the confusion matrix
+                plot_image_path = os.path.join(results_save_path, 'confusion_matrix.png')
+            else:
+                # Default path if results_save_path is not provided
+                plot_image_path = output.split('Confusion matrix saved to ')[1].strip()
+            return f"Training completed successfully.\n\n{output}", plot_image_path
+    except Exception as e:
+        return f"An error occurred:\n{str(e)}", None
+def get_columns_from_data(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name):
+    # Determine data_path
+    if data_option == 'Upload Data File':
+        if data_file is None:
+            return []
+        data_path = data_file
+    elif data_option == 'Provide Data Path':
+        if not os.path.exists(data_path):
+            return []
+    elif data_option == 'Download from Kaggle':
+        if kaggle_json_file is None:
+            return []
+        else:
+            import shutil
+            kaggle_config_dir = os.path.expanduser('~/.kaggle')
+            os.makedirs(kaggle_config_dir, exist_ok=True)
+            kaggle_json_path = os.path.join(kaggle_config_dir, 'kaggle.json')
+            shutil.copy(kaggle_json_file.name, kaggle_json_path)
+            os.chmod(kaggle_json_path, 0o600)
+        data_dir = download_kaggle_data(json_path=kaggle_json_path, competition_name=competition_name)
+        if data_dir is None:
+            return []
+        data_path = os.path.join(data_dir, data_name_kaggle)
+        if not os.path.exists(data_path):
+            return []
+    else:
+        return []
+    try:
+        data = pd.read_csv(data_path)
+        columns = data.columns.tolist()
+        return columns
+    except Exception as e:
+        print(f"Error reading data file: {e}")
+        return []
+def update_columns(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name):
+    columns = get_columns_from_data(data_option, data_file, data_path, data_name_kaggle, kaggle_json_file, competition_name)
+    if not columns:
+        return gr.update(choices=[]), gr.update(choices=[])
+    else:
+        return gr.update(choices=columns), gr.update(choices=columns)
+model_modules = get_classification_model_modules()
+if not model_modules:
+    print("No classification model modules found. Check 'models/supervised/classification' directory.")
+with gr.Blocks() as demo:
+    gr.Markdown("# Train a Classification Model")
+    with gr.Row():
+        model_module_input = gr.Dropdown(choices=model_modules, label="Select Classification Model Module")
+        scoring_metric_input = gr.Textbox(value='accuracy', label="Scoring Metric (e.g., accuracy, f1, roc_auc)")
+    with gr.Row():
+        test_size_input = gr.Slider(minimum=0.1, maximum=0.5, step=0.05, value=0.2, label="Test Size")
+        random_state_input = gr.Number(value=42, label="Random State")
+        cv_folds_input = gr.Number(value=5, label="CV Folds", precision=0)
+    visualize_input = gr.Checkbox(label="Generate Visualizations (metrics & confusion matrix)", value=True)
+    with gr.Row():
+        model_save_path_input = gr.Textbox(value='', label="Model Save Path (optional)")
+        results_save_path_input = gr.Textbox(value='', label="Results Save Path (optional)")
+    with gr.Tab("Data Input"):
+        data_option_input = gr.Radio(choices=['Upload Data File', 'Provide Data Path', 'Download from Kaggle'], label="Data Input Option", value='Upload Data File')
+        upload_data_col = gr.Column(visible=True)
+        with upload_data_col:
+            data_file_input = gr.File(label="Upload CSV Data File", type="filepath")
+        data_path_col = gr.Column(visible=False)
+        with data_path_col:
+            data_path_input = gr.Textbox(value='', label="Data File Path")
+        kaggle_data_col = gr.Column(visible=False)
+        with kaggle_data_col:
+            kaggle_json_file_input = gr.File(label="Upload kaggle.json File", type="filepath")
+            competition_name_input = gr.Textbox(value='', label="Kaggle Competition Name")
+            data_name_kaggle_input = gr.Textbox(value='train.csv', label="Data File Name (in Kaggle dataset)")
+    def toggle_data_input(option):
+        if option == 'Upload Data File':
+            return gr.update(visible=True), gr.update(visible=False), gr.update(visible=False)
+        elif option == 'Provide Data Path':
+            return gr.update(visible=False), gr.update(visible=True), gr.update(visible=False)
+        elif option == 'Download from Kaggle':
+            return gr.update(visible=False), gr.update(visible=False), gr.update(visible=True)
+    data_option_input.change(
+        fn=toggle_data_input,
+        inputs=[data_option_input],
+        outputs=[upload_data_col, data_path_col, kaggle_data_col]
+    )
+    update_cols_btn = gr.Button("Update Columns")
+    target_variable_input = gr.Dropdown(choices=[], label="Select Target Variable")
+    drop_columns_input = gr.CheckboxGroup(choices=[], label="Columns to Drop")
+    update_cols_btn.click(
+        fn=update_columns,
+        inputs=[data_option_input, data_file_input, data_path_input, data_name_kaggle_input, kaggle_json_file_input, competition_name_input],
+        outputs=[target_variable_input, drop_columns_input]
+    )
+    train_btn = gr.Button("Train Model")
+    output_display = gr.Textbox(label="Output")
+    image_display = gr.Image(label="Visualization", visible=True)
+    def run_training(*args):
+        output_text, plot_image_path = train_model(*args)
+        if plot_image_path and os.path.exists(plot_image_path):
+            return output_text, plot_image_path
+        else:
+            return output_text, None
+    train_btn.click(
+        fn=run_training,
+        inputs=[
+            model_module_input, data_option_input, data_file_input, data_path_input,
+            data_name_kaggle_input, kaggle_json_file_input, competition_name_input,
+            target_variable_input, drop_columns_input, test_size_input, random_state_input, cv_folds_input,
+            scoring_metric_input, model_save_path_input, results_save_path_input, visualize_input
+        ],
+        outputs=[output_display, image_display]
+    )
+if __name__ == "__main__":
+    demo.launch()

data/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # data

data/datasets/README.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Datasets Utilities
+This folder contains utility scripts for handling datasets, including downloading data from Kaggle.
+## 📄 Scripts
+### `kaggle_data.py`
+- **Description**: A Python script to download Kaggle datasets or competition data seamlessly, supporting Google Colab, local Linux/Mac, and Windows environments.
+- **Path**: [`data/datasets/kaggle_data.py`](kaggle_data.py)
+- **Key Function**: `get_kaggle_data(json_path, data_name, is_competition=False, output_dir='data/raw')`
+- **Example**:
+  ```python
+  from kaggle_data import get_kaggle_data
+  # Download a standard Kaggle dataset
+  dataset_path = get_kaggle_data("kaggle.json", "paultimothymooney/chest-xray-pneumonia")
+  # Download competition data
+  competition_path = get_kaggle_data("kaggle.json", "house-prices-advanced-regression-techniques", is_competition=True)

data/datasets/kaggle_data.py ADDED Viewed

	@@ -0,0 +1,115 @@

+"""
+This module provides a utility function to download Kaggle datasets or competition data.
+The function automatically detects whether it is running in a Google Colab environment, a local Linux/Mac environment, or a Windows environment, and sets up the Kaggle API accordingly.
+Requirements:
+    - Kaggle API installed (`pip install kaggle`)
+    - Kaggle API key (`kaggle.json`) with appropriate permissions.
+Environment Detection:
+    - Google Colab: Uses `/root/.config/kaggle/kaggle.json`.
+    - Local Linux/Mac: Uses `~/.kaggle/kaggle.json`.
+    - Windows: Uses `C:\\Users\\<Username>\\.kaggle\\kaggle.json`.
+Functions:
+    get_kaggle_data(json_path: str, data_name: str, is_competition: bool = False, output_dir: str = "data/raw") -> str
+"""
+import os
+import zipfile
+import sys
+import shutil
+import platform
+def get_kaggle_data(json_path: str, data_name: str, is_competition: bool = False, output_dir: str = "data/raw") -> str:
+    """
+    Downloads a Kaggle dataset or competition data using the Kaggle API in Google Colab, local Linux/Mac, or Windows environment.
+    Parameters:
+        json_path (str): Path to your 'kaggle.json' file.
+        data_name (str): Kaggle dataset or competition name (e.g., 'paultimothymooney/chest-xray-pneumonia' or 'house-prices-advanced-regression-techniques').
+        is_competition (bool): Set to True if downloading competition data. Default is False (for datasets).
+        output_dir (str): Directory to save and extract the data. Default is 'data'.
+    Returns:
+        str: Path to the extracted dataset folder.
+    Raises:
+        OSError: If 'kaggle.json' is not found or cannot be copied.
+        Exception: If there is an error during download or extraction.
+    Example of Usage:
+        # For downloading a standard dataset
+        dataset_path = get_kaggle_data("kaggle.json", "paultimothymooney/chest-xray-pneumonia")
+        print(f"Dataset is available at: {dataset_path}")
+        # For downloading competition data
+        competition_path = get_kaggle_data("kaggle.json", "house-prices-advanced-regression-techniques", is_competition=True)
+        print(f"Competition data is available at: {competition_path}")
+    """
+    # Detect environment (Colab, local Linux/Mac, or Windows)
+    is_colab = "google.colab" in sys.modules
+    is_windows = platform.system() == "Windows"
+    # Step 1: Setup Kaggle API credentials
+    try:
+        if is_colab:
+            config_dir = "/root/.config/kaggle"
+            os.makedirs(config_dir, exist_ok=True)
+            print("Setting up Kaggle API credentials for Colab environment.")
+            shutil.copy(json_path, os.path.join(config_dir, "kaggle.json"))
+            os.chmod(os.path.join(config_dir, "kaggle.json"), 0o600)
+        else:
+            # For both local Linux/Mac and Windows, use the home directory
+            config_dir = os.path.join(os.path.expanduser("~"), ".kaggle")
+            os.makedirs(config_dir, exist_ok=True)
+            print("Setting up Kaggle API credentials for local environment.")
+            kaggle_json_dest = os.path.join(config_dir, "kaggle.json")
+            if not os.path.exists(kaggle_json_dest):
+                shutil.copy(json_path, kaggle_json_dest)
+                if not is_windows:
+                    os.chmod(kaggle_json_dest, 0o600)
+    except Exception as e:
+        raise OSError(f"Could not set up Kaggle API credentials: {e}")
+    # Step 2: Create output directory
+    dataset_dir = os.path.join(output_dir, data_name.split('/')[-1])
+    os.makedirs(dataset_dir, exist_ok=True)
+    original_dir = os.getcwd()
+    os.chdir(dataset_dir)
+    # Step 3: Download the dataset or competition data
+    try:
+        if is_competition:
+            print(f"Downloading competition data: {data_name}")
+            cmd = f"kaggle competitions download -c {data_name}"
+        else:
+            print(f"Downloading dataset: {data_name}")
+            cmd = f"kaggle datasets download -d {data_name}"
+        os.system(cmd)
+    except Exception as e:
+        print(f"Error during download: {e}")
+        os.chdir(original_dir)
+        return None
+    # Step 4: Unzip all downloaded files
+    zip_files = [f for f in os.listdir() if f.endswith(".zip")]
+    if not zip_files:
+        print("No zip files found. Please check the dataset or competition name.")
+        os.chdir(original_dir)
+        return None
+    for zip_file in zip_files:
+        try:
+            with zipfile.ZipFile(zip_file, "r") as zip_ref:
+                zip_ref.extractall()
+            print(f"Extracted: {zip_file}")
+            os.remove(zip_file)
+        except Exception as e:
+            print(f"Error extracting {zip_file}: {e}")
+    # Step 5: Navigate back to the original directory
+    os.chdir(original_dir)
+    return dataset_dir

data/raw/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # raw

models/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # models

models/supervised/classification/README.md ADDED Viewed

	@@ -0,0 +1,32 @@

+# Classification Models
+This directory contains Python scripts that define various classification models and their associated hyperparameter grids. Each model file sets up a scikit-learn-compatible estimator and defines a parameter grid for use with the `train_classification_model.py` script.
+These model definition files:
+- Specify an estimator (e.g., `LogisticRegression()`, `RandomForestClassifier()`, `XGBClassifier()`).
+- Define a `param_grid` dict for hyperparameter tuning using `GridSearchCV`.
+- Optionally provide a `default_scoring` metric (e.g., `accuracy`).
+- Work for both binary and multi-class classification tasks.
+- Are intended to be flexible and modular, allowing easy swapping of models without changing other parts of the code.
+**Note:** Preprocessing steps, hyperparameter tuning logic, and label encoding for categorical targets are handled externally by the scripts and utilities.
+## Available Classification Models
+- [Logistic Regression](logistic_regression.py)
+- [Decision Tree Classifier](decision_tree_classifier.py)
+- [Random Forest Classifier (Bagging)](random_forest_classifier.py)
+- [Extra Trees Classifier](extra_trees_classifier.py)
+- [Gradient Boosting Classifier (Boosting)](gradient_boosting_classifier.py)
+- [AdaBoost Classifier (Boosting)](adaboost_classifier.py)
+- [XGBoost Classifier (Boosting)](xgboost_classifier.py)
+- [LightGBM Classifier (Boosting)](lightgbm_classifier.py)
+- [CatBoost Classifier (Boosting)](catboost_classifier.py)
+- [Support Vector Classifier (SVC)](svc.py)
+- [K-Nearest Neighbors (KNN) Classifier](knn_classifier.py)
+- [Multilayer Perceptron (MLP) Classifier](mlp_classifier.py)
+- [GaussianNB (Naive Bayes Classifier)](gaussian_nb.py)
+- [Linear Discriminant Analysis (LDA)](linear_discriminant_analysis.py)
+- [Quadratic Discriminant Analysis (QDA)](quadratic_discriminant_analysis.py)
+To train any of these models, specify the `--model_module` argument with the appropriate model name (e.g., `logistic_regression`) when running `train_classification_model.py`.

models/supervised/classification/adaboost_classifier.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+AdaBoost Classifier setup.
+Features:
+- Uses `AdaBoostClassifier` wrapping a weak learner (by default DecisionTreeClassifier).
+- Suitable for binary and multi-class tasks (OvR approach).
+- Default scoring: 'accuracy'.
+"""
+from sklearn.ensemble import AdaBoostClassifier
+estimator = AdaBoostClassifier(random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__learning_rate': [0.5, 1.0],
+    'model__algorithm': ['SAMME'],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/catboost_classifier.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+CatBoost Classifier setup.
+Features:
+- Uses `CatBoostClassifier`.
+- Handles categorical features natively but we still rely on pipeline encoding.
+- Good for both binary and multi-class.
+- Default scoring: 'accuracy'.
+Requires `catboost` installed.
+"""
+from catboost import CatBoostClassifier
+estimator = CatBoostClassifier(verbose=0, random_state=42)
+param_grid = {
+    'model__iterations': [100],
+    'model__depth': [3, 5],
+    'model__learning_rate': [0.01, 0.1],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/decision_tree_classifier.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+This module sets up a Decision Tree Classifier for hyperparameter tuning.
+Features:
+- Uses `DecisionTreeClassifier` from scikit-learn.
+- Defines a parameter grid suitable for both binary and multi-class classification.
+- Default scoring: 'accuracy'.
+Considerations:
+- `criterion`, `max_depth`, `min_samples_split`, and `min_samples_leaf` are common parameters to tune.
+- Ordinal encoding will be used for tree-based models if implemented, but the pipeline code decides that.
+"""
+from sklearn.tree import DecisionTreeClassifier
+estimator = DecisionTreeClassifier(random_state=42)
+param_grid = {
+    'model__criterion': ['gini', 'entropy'],
+    'model__max_depth': [None, 5, 10],
+    'model__min_samples_split': [2, 5],
+    'model__min_samples_leaf': [1, 2],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    #'preprocessor__num__scaler__with_mean': [True, False],
+    #'preprocessor__num__scaler__with_std': [True, False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/extra_trees_classifier.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+Extra Trees Classifier setup.
+Features:
+- Uses `ExtraTreesClassifier`.
+- Similar to RandomForest but with more randomness in splits.
+- Works well for both binary and multi-class.
+"""
+from sklearn.ensemble import ExtraTreesClassifier
+estimator = ExtraTreesClassifier(random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__max_depth': [None, 10],
+    'model__min_samples_split': [2, 5],
+    'model__min_samples_leaf': [1],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/gaussian_nb.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+Gaussian Naive Bayes Classifier setup.
+Features:
+- Uses `GaussianNB`.
+- Suitable for binary and multi-class.
+- Default scoring: 'accuracy'.
+Considerations:
+- `var_smoothing` is often the only parameter to tune.
+"""
+from sklearn.naive_bayes import GaussianNB
+estimator = GaussianNB()
+param_grid = {
+    'model__var_smoothing': [1e-1, 1e-3, 1e-5, 1e-7, 1e-9],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/gradient_boosting_classifier.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+Gradient Boosting Classifier setup.
+Features:
+- Uses `GradientBoostingClassifier`.
+- Great for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+"""
+from sklearn.ensemble import GradientBoostingClassifier
+estimator = GradientBoostingClassifier(random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__learning_rate': [0.01, 0.1],
+    'model__max_depth': [3],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/knn_classifier.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+K-Nearest Neighbors Classifier setup.
+Features:
+- Uses `KNeighborsClassifier`.
+- Works for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+Considerations:
+- `n_neighbors`, `weights`, and `p` (Minkowski distance) are common parameters to tune.
+"""
+from sklearn.neighbors import KNeighborsClassifier
+estimator = KNeighborsClassifier()
+param_grid = {
+    'model__n_neighbors': [3, 5],  # Reduced to two neighbor options
+    'model__weights': ['uniform'],  # Focused on one weighting strategy
+    'model__p': [2],  # Fixed to Euclidean distance
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean'],
+    #'preprocessor__num__scaler__with_mean': [True],
+    #'preprocessor__num__scaler__with_std': [True],
+}
+default_scoring = 'accuracy'

models/supervised/classification/lightgbm_classifier.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+LightGBM Classifier setup.
+Features:
+- Uses `LGBMClassifier`.
+- Fast and efficient for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+Requires `lightgbm` installed.
+"""
+from lightgbm import LGBMClassifier
+estimator = LGBMClassifier(verbose=-1, random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__num_leaves': [31, 63],
+    'model__learning_rate': [0.01, 0.1],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/linear_discriminant_analysis.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+Linear Discriminant Analysis (LDA) Classifier setup.
+Features:
+- Uses `LinearDiscriminantAnalysis`.
+- Works for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+Considerations:
+- `solver` can be tuned.
+- Some solvers allow `shrinkage` parameter.
+"""
+from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+estimator = LinearDiscriminantAnalysis()
+param_grid = {
+    'model__solver': ['svd', 'lsqr'],
+    # If solver='lsqr', can tune shrinkage parameter if needed
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/logistic_regression.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""
+This module sets up a Logistic Regression classifier for hyperparameter tuning.
+Features:
+- Uses `LogisticRegression` from scikit-learn.
+- Defines a hyperparameter grid for both preprocessing and model parameters.
+- Suitable for binary and multi-class classification (LogisticRegression uses OvR/One-vs-Rest by default).
+- Default scoring: 'accuracy', which works well for both binary and multi-class tasks.
+Considerations:
+- Adjusting `C` controls regularization strength.
+- `penalty='l2'` is commonly used.
+- One can add more solvers or penalties as needed.
+"""
+from sklearn.linear_model import LogisticRegression
+# Define the estimator
+estimator = LogisticRegression()
+# Define the hyperparameter grid
+param_grid = {
+    # Model parameters
+    'model__C': [0.01, 0.1, 1.0, 10.0],  # Regularization strength
+    'model__penalty': ['l2'],            # Only L2 regularization supported in LogisticRegression(solver='lbfgs')
+    'model__solver': ['lbfgs'],  # Efficient solver for large datasets
+    'model__max_iter': [1000] # Control convergence
+    # Preprocessing parameters for numerical features
+    #'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    #'preprocessor__num__scaler__with_mean': [True, False],
+    #'preprocessor__num__scaler__with_std': [True, False],
+}
+# Optional: Default scoring metric for classification
+default_scoring = 'accuracy'

models/supervised/classification/mlp_classifier.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+MLP Classifier setup.
+Features:
+- Uses `MLPClassifier`.
+- Suitable for binary and multi-class classification.
+- Default scoring: 'accuracy'.
+Considerations:
+- `hidden_layer_sizes`, `alpha` (L2 regularization), and `learning_rate_init` are common parameters.
+- Increase `max_iter` if convergence warnings appear.
+"""
+from sklearn.neural_network import MLPClassifier
+# Define the estimator
+estimator = MLPClassifier(max_iter=200, random_state=42)
+# Define the hyperparameter grid
+param_grid = {
+    'model__hidden_layer_sizes': [(50,)],  # Reduced size of hidden layers for faster training
+    'model__alpha': [0.001],  # Retained commonly effective value
+    'model__learning_rate_init': [0.001],  # Focused on a single typical value for faster tuning
+    # Uncomment and customize preprocessing params if needed
+    #'preprocessor__num__imputer__strategy': ['mean'],
+    #'preprocessor__num__scaler__with_mean': [True],
+    #'preprocessor__num__scaler__with_std': [True],
+}
+default_scoring = 'accuracy'

models/supervised/classification/quadratic_discriminant_analysis.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+Quadratic Discriminant Analysis (QDA) Classifier setup.
+Features:
+- Uses `QuadraticDiscriminantAnalysis`.
+- Works for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+Considerations:
+- `reg_param` can be tuned to control regularization.
+"""
+from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
+estimator = QuadraticDiscriminantAnalysis()
+param_grid = {
+    'model__reg_param': [0.0, 0.1, 0.5],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/random_forest_classifier.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+Random Forest Classifier setup.
+Features:
+- Uses `RandomForestClassifier` from scikit-learn.
+- Good general-purpose model for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+"""
+from sklearn.ensemble import RandomForestClassifier
+estimator = RandomForestClassifier(random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__max_depth': [None, 10],
+    'model__min_samples_split': [2, 5],
+    'model__min_samples_leaf': [1],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    #'preprocessor__num__scaler__with_mean': [True, False],
+    #'preprocessor__num__scaler__with_std': [True, False],
+}
+default_scoring = 'accuracy'

models/supervised/classification/svc.py ADDED Viewed

	@@ -0,0 +1,29 @@

+"""
+Support Vector Classifier setup.
+Features:
+- Uses `SVC` from scikit-learn.
+- Handles binary classification naturally, and multi-class via OvR by default.
+- Default scoring: 'accuracy'.
+Considerations:
+- `C` and `kernel` are key parameters.
+- If `kernel='rbf'`, also tune `gamma`.
+"""
+from sklearn.svm import SVC
+estimator = SVC(random_state=42)
+param_grid = {
+    'model__C': [0.1, 1.0],  # Reduced the range
+    'model__kernel': ['linear'],  # Focused on linear kernel
+    'model__gamma': ['scale'],  # Fixed the gamma to one option
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean'],
+    #'preprocessor__num__scaler__with_mean': [True],
+    #'preprocessor__num__scaler__with_std': [True],
+}
+default_scoring = 'accuracy'

models/supervised/classification/xgboost_classifier.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""
+XGBoost Classifier setup.
+Features:
+- Uses `XGBClassifier` from xgboost library.
+- Excellent performance for binary and multi-class tasks.
+- Default scoring: 'accuracy'.
+Note: Ensure `xgboost` is installed.
+"""
+from xgboost import XGBClassifier
+estimator = XGBClassifier(eval_metric='logloss', random_state=42)
+param_grid = {
+    'model__n_estimators': [100],
+    'model__max_depth': [3, 5],
+    'model__learning_rate': [0.01, 0.1],
+    # Preprocessing params
+    #'preprocessor__num__imputer__strategy': ['mean','median'],
+    #'preprocessor__num__scaler__with_mean': [True,False],
+    #'preprocessor__num__scaler__with_std': [True,False],
+}
+default_scoring = 'accuracy'

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+pandas==2.2.2
+numpy==1.26.4
+matplotlib==3.8.0
+seaborn==0.13.2
+kaggle==1.6.17
+scikit-learn==1.5.2
+catboost==1.2.7
+dask[dataframe]==2024.10.0
+xgboost==2.1.2
+lightgbm==4.5.0
+joblib==1.4.2
+gradio==5.7.1

scripts/README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Scripts
+This directory contains executable scripts for training, testing, and other tasks related to model development and evaluation.
+## Contents
+- [`train_regression_model.py`](#train_regression_modelpy)
+- [`train_classification_model.py`](#train_classification_modelpy)
+### `train_regression_model.py`
+A script for training supervised learning **regression** models using scikit-learn. It handles data loading, preprocessing, optional log transformation, hyperparameter tuning, model evaluation, and saving of models, metrics, and visualizations.
+#### Features
+- Supports various regression models defined in `models/supervised/regression`.
+- Performs hyperparameter tuning using grid search cross-validation.
+- Saves trained models and evaluation metrics.
+- Generates visualizations if specified.
+#### Usage
+```bash
+python train_regression_model.py --model_module MODEL_MODULE \
+    --data_path DATA_PATH/DATA_NAME.csv \
+    --target_variable TARGET_VARIABLE [OPTIONS]
+```
+- **Required Arguments:**
+- `model_module`: Name of the regression model module to import (e.g., `linear_regression`).
+- `data_path`: Path to the dataset directory, including the data file name.
+- `target_variable`: Name of the target variable.
+- **Optional Arguments:**
+- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
+- `random_state`: Random seed for reproducibility (default: `42`).
+- `log_transform`: Apply log transformation to the target variable (regression only).
+- `cv_folds`: Number of cross-validation folds (default: `5`).
+- `scoring_metric`: Scoring metric for model evaluation.
+- `model_path`: Path to save the trained model.
+- `results_path`: Path to save results and metrics.
+- `visualize`: Generate and save visualizations.
+- `drop_columns`: Comma-separated column names to drop from the dataset.
+#### Usage Example
+```bash
+python train_regression_model.py --model_module linear_regression \
+    --data_path data/house_prices/train.csv \
+    --target_variable SalePrice --drop_columns Id \
+    --log_transform --visualize
+```
+---
+### `train_classification_model.py`
+A script for training supervised learning **classification** models using scikit-learn. It handles data loading, preprocessing, hyperparameter tuning (via grid search CV), model evaluation using classification metrics, and saving of models, metrics, and visualizations.
+#### Features
+- Supports various classification models defined in `models/supervised/classification`.
+- Performs hyperparameter tuning using grid search cross-validation (via `classification_hyperparameter_tuning`).
+- Saves trained models and evaluation metrics (accuracy, precision, recall, F1).
+- If `visualize` is enabled, it generates a metrics bar chart and a confusion matrix plot.
+#### Usage
+```bash
+python train_classification_model.py --model_module MODEL_MODULE \
+    --data_path DATA_PATH/DATA_NAME.csv \
+    --target_variable TARGET_VARIABLE [OPTIONS]
+```
+- **Required Arguments:**
+- `model_module`: Name of the classification model module to import (e.g., `logistic_regression`).
+- `data_path`: Path to the dataset directory, including the data file name.
+- `target_variable`: Name of the target variable (categorical).
+- **Optional Arguments:**
+- `test_size`: Proportion of the dataset to include in the test split (default: `0.2`).
+- `random_state`: Random seed for reproducibility (default: `42`).
+- `cv_folds`: Number of cross-validation folds (default: `5`).
+- `scoring_metric`: Scoring metric for model evaluation (e.g., `accuracy`, `f1`, `roc_auc`).
+- `model_path`: Path to save the trained model.
+- `results_path`: Path to save results and metrics.
+- `visualize`: Generate and save visualizations.
+- `drop_columns`: Comma-separated column names to drop from the dataset.
+#### Usage Example
+```bash
+python train_classification_model.py --model_module logistic_regression \
+    --data_path data/adult_income/train.csv \
+    --target_variable income_bracket \
+    --scoring_metric accuracy --visualize
+```

scripts/train_classification_model.py ADDED Viewed

	@@ -0,0 +1,203 @@

+"""
+This script trains classification models using scikit-learn.
+It handles data loading, preprocessing, hyperparameter tuning,
+model evaluation with classification metrics, and saving of models,
+metrics, and visualizations.
+Usage:
+    python train_classification_model.py --model_module MODEL_MODULE --data_path DATA_PATH/DATA_NAME.csv
+                                         --target_variable TARGET_VARIABLE
+Optional arguments:
+    --test_size TEST_SIZE
+    --random_state RANDOM_STATE
+    --cv_folds CV_FOLDS
+    --scoring_metric SCORING_METRIC
+    --model_path MODEL_PATH
+    --results_path RESULTS_PATH
+    --visualize
+    --drop_columns COLUMN_NAMES
+Example:
+    python train_classification_model.py --model_module logistic_regression
+                                         --data_path data/adult_income/train.csv
+                                         --target_variable income_bracket --drop_columns Id
+                                         --scoring_metric accuracy --visualize
+"""
+import os
+import sys
+import argparse
+import importlib
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
+                             confusion_matrix, ConfusionMatrixDisplay)
+import joblib
+from timeit import default_timer as timer
+def main(args):
+    # Change to the root directory of the project
+    project_root = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+    os.chdir(project_root)
+    sys.path.insert(0, project_root)
+    # Import the hyperparameter tuning and the model modules
+    from utils.supervised_hyperparameter_tuning import classification_hyperparameter_tuning
+    model_module_path = f"models.supervised.classification.{args.model_module}"
+    model_module = importlib.import_module(model_module_path)
+    # Get the model estimator, parameters grid, and scoring metric
+    estimator = model_module.estimator
+    param_grid = model_module.param_grid
+    scoring_metric = args.scoring_metric or getattr(model_module, 'default_scoring', 'accuracy')
+    model_name = estimator.__class__.__name__
+    # Set default paths if not provided
+    args.model_path = args.model_path or os.path.join('saved_models', model_name)
+    args.results_path = args.results_path or os.path.join('results', model_name)
+    os.makedirs(args.results_path, exist_ok=True)
+    # Load the dataset
+    df = pd.read_csv(os.path.join(args.data_path))
+    # Drop specified columns
+    if args.drop_columns:
+        columns_to_drop = args.drop_columns.split(',')
+        df = df.drop(columns=columns_to_drop)
+    # Define target variable and features
+    target_variable = args.target_variable
+    X = df.drop(columns=[target_variable])
+    y = df[target_variable]
+    # Ensure target variable is not numeric (or at least, is categorical)
+    # It's fine if it's numeric labels for classes, but typically classification is categorical.
+    # We'll just run as is and rely on the estimator to handle it.
+    # If needed, we can print a note:
+    if np.issubdtype(y.dtype, np.number) and len(np.unique(y)) > 20:
+        # Large number of unique values might indicate a regression-like problem
+        print(f"Warning: The target variable '{target_variable}' seems to have many unique numeric values. Ensure it's truly a classification problem.")
+    # Encode target variable if not numeric
+    if y.dtype == 'object' or not np.issubdtype(y.dtype, np.number):
+        from sklearn.preprocessing import LabelEncoder
+        le = LabelEncoder()
+        y = le.fit_transform(y)
+        # Save label encoder so that we can interpret predictions later
+        # Create model_path directory if not exists
+        os.makedirs(args.model_path, exist_ok=True)
+        joblib.dump(le, os.path.join(args.model_path, 'label_encoder.pkl'))
+        print("LabelEncoder applied to target variable. Classes:", le.classes_)
+    # Split the data
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=args.test_size, random_state=args.random_state)
+    # Start the timer
+    start_time = timer()
+    # Perform hyperparameter tuning (classification)
+    best_model, best_params = classification_hyperparameter_tuning(
+        X_train, y_train, estimator, param_grid,
+        cv=args.cv_folds, scoring=scoring_metric)
+    # End the timer and calculate how long it took
+    end_time = timer()
+    train_time = end_time - start_time
+    # Evaluate the best model on the test set
+    y_pred = best_model.predict(X_test)
+    # Calculate classification metrics
+    accuracy = accuracy_score(y_test, y_pred)
+    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
+    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
+    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
+    print(f"\n{model_name} Classification Metrics on Test Set:")
+    print(f"- Accuracy: {accuracy:.4f}")
+    print(f"- Precision: {precision:.4f}")
+    print(f"- Recall: {recall:.4f}")
+    print(f"- F1 Score: {f1:.4f}")
+    print(f"- Training Time: {train_time:.4f} seconds")
+    # Save the trained model
+    model_output_path = os.path.join(args.model_path, 'best_model.pkl')
+    os.makedirs(args.model_path, exist_ok=True)
+    joblib.dump(best_model, model_output_path)
+    print(f"Trained model saved to {model_output_path}")
+    # Save metrics to CSV
+    metrics = {
+        'Accuracy': [accuracy],
+        'Precision': [precision],
+        'Recall': [recall],
+        'F1 Score': [f1],
+        'train_time': [train_time]
+    }
+    results_df = pd.DataFrame(metrics)
+    results_df.to_csv(os.path.join(args.results_path, 'metrics.csv'), index=False)
+    print(f"\nMetrics saved to {os.path.join(args.results_path, 'metrics.csv')}")
+    if args.visualize:
+        # Plot Classification Metrics
+        plt.figure(figsize=(8, 6))
+        metric_names = list(metrics.keys())
+        metric_values = [value[0] for value in metrics.values() if value[0] is not None and isinstance(value[0], (int,float))]
+        plt.bar(metric_names[:-1], metric_values[:-1], color='skyblue', alpha=0.8)  # exclude train_time from plotting
+        plt.ylim(0, 1)
+        plt.xlabel('Metrics')
+        plt.ylabel('Scores')
+        plt.title('Classification Metrics')
+        plt.savefig(os.path.join(args.results_path, 'classification_metrics.png'))
+        plt.show()
+        print(f"Visualization saved to {os.path.join(args.results_path, 'classification_metrics.png')}")
+        # Display and save the confusion matrix
+        from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
+        conf_matrix = confusion_matrix(y_test, y_pred)
+        disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
+        disp.plot(cmap=plt.cm.Blues, values_format='d')
+        plt.title(f'{model_name} Confusion Matrix')
+        conf_matrix_path = os.path.join(args.results_path, 'confusion_matrix.png')
+        plt.savefig(conf_matrix_path)
+        plt.show()
+        print(f"Confusion matrix saved to {conf_matrix_path}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Train a classification model.")
+    # Model module argument
+    parser.add_argument('--model_module', type=str, required=True,
+                        help='Name of the classification model module to import.')
+    # Data arguments
+    parser.add_argument('--data_path', type=str, required=True,
+                        help='Path to the dataset file including data name.')
+    parser.add_argument('--target_variable', type=str, required=True,
+                        help='Name of the target variable (categorical).')
+    parser.add_argument('--drop_columns', type=str, default='',
+                        help='Columns to drop from the dataset.')
+    # Model arguments
+    parser.add_argument('--test_size', type=float, default=0.2,
+                        help='Proportion for test split.')
+    parser.add_argument('--random_state', type=int, default=42,
+                        help='Random seed.')
+    parser.add_argument('--cv_folds', type=int, default=5,
+                        help='Number of cross-validation folds.')
+    parser.add_argument('--scoring_metric', type=str, default=None,
+                        help='Scoring metric for model evaluation (e.g., accuracy, f1, roc_auc).')
+    # Output arguments
+    parser.add_argument('--model_path', type=str, default=None,
+                        help='Path to save the trained model.')
+    parser.add_argument('--results_path', type=str, default=None,
+                        help='Path to save results and metrics.')
+    parser.add_argument('--visualize', action='store_true',
+                        help='Generate and save visualizations (classification metrics chart and confusion matrix).')
+    args = parser.parse_args()
+    main(args)

utils/README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Utils
+This directory contains utility scripts and helper functions that are used throughout the project. These scripts provide common functionalities such as data preprocessing, hyperparameter tuning, and other support functions that assist in model training and evaluation for both regression and classification tasks.
+## Contents
+- [`supervised_hyperparameter_tuning.py`](#supervised_hyperparameter_tuningpy)
+### `supervised_hyperparameter_tuning.py`
+This script contains functions for performing hyperparameter tuning on supervised learning models (both regression and classification) using scikit-learn's `Pipeline` and `GridSearchCV`.
+#### Functions
+- **`regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`**
+  Performs hyperparameter tuning for regression models.
+  **Parameters:**
+  - `X`: Feature matrix (pd.DataFrame).
+  - `y`: Numeric target variable (pd.Series).
+  - `estimator`: A scikit-learn regressor (e.g., `LinearRegression()`).
+  - `param_grid`: Dict with parameter names and lists of values.
+  - `cv`: Number of cross-validation folds (default 5).
+  - `scoring`: Scoring metric (e.g. 'neg_root_mean_squared_error').
+  **Returns:**
+  - `best_model`: Pipeline with best found hyperparameters.
+  - `best_params`: Dictionary of best hyperparameters.
+- **`classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None)`**
+  Performs hyperparameter tuning for classification models.
+  **Parameters:**
+  - `X`: Feature matrix (pd.DataFrame).
+  - `y`: Target variable for classification (pd.Series), can be binary or multi-class.
+  - `estimator`: A scikit-learn classifier (e.g., `LogisticRegression()`, `RandomForestClassifier()`).
+  - `param_grid`: Dict with parameter names and lists of values.
+  - `cv`: Number of cross-validation folds (default 5).
+  - `scoring`: Scoring metric (e.g. 'accuracy', 'f1', 'f1_macro', 'roc_auc').
+  **Returns:**
+  - `best_model`: Pipeline with best found hyperparameters.
+  - `best_params`: Dictionary of best hyperparameters.
+#### Usage Examples
+**Regression Example:**
+```python
+from utils.supervised_hyperparameter_tuning import regression_hyperparameter_tuning
+from sklearn.linear_model import LinearRegression
+X = ...  # Your regression features
+y = ...  # Your numeric target variable
+param_grid = {
+    'model__fit_intercept': [True, False]
+    # Add other parameters if needed
+}
+best_model, best_params = regression_hyperparameter_tuning(X, y, LinearRegression(), param_grid, scoring='neg_root_mean_squared_error')
+```
+**Classification Example (Binary or Multi-Class):**
+```python
+from utils.supervised_hyperparameter_tuning import classification_hyperparameter_tuning
+from sklearn.ensemble import RandomForestClassifier
+X = ...  # Your classification features
+y = ...  # Your categorical target variable (binary or multi-class)
+param_grid = {
+    'model__n_estimators': [100, 200],
+    'model__max_depth': [None, 10]
+}
+best_model, best_params = classification_hyperparameter_tuning(X, y, RandomForestClassifier(), param_grid, scoring='accuracy')
+```

utils/supervised_hyperparameter_tuning.py ADDED Viewed

	@@ -0,0 +1,213 @@

+"""
+This module provides functions for hyperparameter tuning with preprocessing using scikit-learn's GridSearchCV
+for both regression and classification tasks.
+Features:
+- Handles numerical and categorical preprocessing using pipelines.
+- Automates hyperparameter tuning for any scikit-learn estimator.
+- Uses GridSearchCV for cross-validation and hyperparameter search.
+- Applies algorithm-specific preprocessing when necessary (e.g., ordinal encoding for tree-based models).
+Functions:
+    - regression_hyperparameter_tuning: For regression models.
+    - classification_hyperparameter_tuning: For classification models.
+Example Usage (Regression):
+    from sklearn.ensemble import RandomForestRegressor
+    from supervised_hyperparameter_tuning import regression_hyperparameter_tuning
+    X = ...  # Your feature DataFrame
+    y = ...  # Your numeric target variable
+    param_grid = {
+        'model__n_estimators': [100, 200],
+        'model__max_depth': [None, 10]
+    }
+    best_model, best_params = regression_hyperparameter_tuning(X, y, RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
+Example Usage (Classification):
+    from sklearn.ensemble import RandomForestClassifier
+    from supervised_hyperparameter_tuning import classification_hyperparameter_tuning
+    X = ...  # Your feature DataFrame
+    y = ...  # Your target variable (categorical)
+    param_grid = {
+        'model__n_estimators': [100, 200],
+        'model__max_depth': [None, 10]
+    }
+    best_model, best_params = classification_hyperparameter_tuning(X, y, RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
+"""
+from sklearn.compose import ColumnTransformer
+from sklearn.impute import SimpleImputer
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
+from sklearn.model_selection import GridSearchCV, KFold
+def regression_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None):
+    """
+    Performs hyperparameter tuning for a given regression model using GridSearchCV with preprocessing.
+    Args:
+        X (pd.DataFrame): Features.
+        y (pd.Series): Target variable.
+        estimator: The scikit-learn regressor to use (e.g., LinearRegression(), RandomForestRegressor()).
+        param_grid (dict): Hyperparameter grid for GridSearchCV.
+        cv (int or cross-validation generator): Number of cross-validation folds or a cross-validation generator.
+        scoring (str or None): Scoring metric to use.
+    Returns:
+        best_model (Pipeline): Best model within a pipeline from GridSearch.
+        best_params (dict): Best hyperparameters.
+    """
+    # Identify numerical and categorical columns
+    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
+    categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
+    # Define preprocessing for numerical data
+    numerical_transformer = Pipeline(steps=[
+        ('imputer', SimpleImputer(strategy='median')),
+        ('scaler', StandardScaler())
+    ])
+    # Conditional preprocessing for categorical data
+    estimator_name = estimator.__class__.__name__
+    if estimator_name in [
+        'DecisionTreeRegressor', 'RandomForestRegressor', 'ExtraTreesRegressor',
+        'GradientBoostingRegressor', 'XGBRegressor', 'LGBMRegressor', 'CatBoostRegressor'
+    ]:
+        # Use Ordinal Encoding for tree-based models
+        categorical_transformer = Pipeline(steps=[
+            ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
+            ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
+        ])
+    else:
+        # Use OneHotEncoder for other models
+        categorical_transformer = Pipeline(steps=[
+            ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
+            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
+        ])
+    # Create preprocessing pipeline
+    preprocessor = ColumnTransformer(
+        transformers=[
+            ('num', numerical_transformer, numerical_cols),
+            ('cat', categorical_transformer, categorical_cols)
+        ]
+    )
+    # Create a pipeline that combines preprocessing and the estimator
+    pipeline = Pipeline(steps=[
+        ('preprocessor', preprocessor),
+        ('model', estimator)
+    ])
+    # Define cross-validation strategy
+    if isinstance(cv, int):
+        cv = KFold(n_splits=cv, shuffle=True, random_state=42)
+    # Initialize GridSearchCV
+    grid_search = GridSearchCV(
+        estimator=pipeline,
+        param_grid=param_grid,
+        cv=cv,
+        scoring=scoring,
+        n_jobs=-1
+    )
+    # Perform Grid Search
+    grid_search.fit(X, y)
+    # Get the best model and parameters
+    best_model = grid_search.best_estimator_
+    best_params = grid_search.best_params_
+    print(f"Best Hyperparameters for {estimator_name}:")
+    for param_name in sorted(best_params.keys()):
+        print(f"{param_name}: {best_params[param_name]}")
+    return best_model, best_params
+def classification_hyperparameter_tuning(X, y, estimator, param_grid, cv=5, scoring=None):
+    """
+    Performs hyperparameter tuning for a given classification model using GridSearchCV with preprocessing.
+    This function is similar to the regression one but adapted for classification tasks. It can handle both
+    binary and multi-class classification. The choice of scoring metric (e.g., 'accuracy', 'f1', 'f1_macro', 'roc_auc')
+    will determine how we evaluate the model, but the pipeline structure remains the same.
+    Args:
+        X (pd.DataFrame): Features.
+        y (pd.Series): Target variable (categorical) for classification (can be binary or multi-class).
+        estimator: The scikit-learn classifier to use (e.g., LogisticRegression(), RandomForestClassifier()).
+        param_grid (dict): Hyperparameter grid for GridSearchCV.
+        cv (int or cross-validation generator): Number of cross-validation folds or a CV generator.
+        scoring (str or None): Scoring metric (e.g., 'accuracy' for binary or multi-class, 'f1_macro' for multi-class).
+    Returns:
+        best_model (Pipeline): Best model within a pipeline from GridSearch.
+        best_params (dict): Best hyperparameters.
+    """
+    # Identify numerical and categorical columns
+    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
+    categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
+    # Define preprocessing for numerical data
+    numerical_transformer = Pipeline(steps=[
+        ('imputer', SimpleImputer(strategy='median')),
+        ('scaler', StandardScaler())
+    ])
+    # Determine encoding strategy based on model type (tree-based vs. others)
+    estimator_name = estimator.__class__.__name__
+    tree_based_classifiers = [
+        'DecisionTreeClassifier', 'RandomForestClassifier', 'ExtraTreesClassifier',
+        'GradientBoostingClassifier', 'XGBClassifier', 'LGBMClassifier', 'CatBoostClassifier'
+    ]
+    if estimator_name in tree_based_classifiers:
+        categorical_transformer = Pipeline(steps=[
+            ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
+            ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
+        ])
+    else:
+        categorical_transformer = Pipeline(steps=[
+            ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
+            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
+        ])
+    # Create preprocessing pipeline
+    preprocessor = ColumnTransformer(transformers=[
+        ('num', numerical_transformer, numerical_cols),
+        ('cat', categorical_transformer, categorical_cols)
+    ])
+    # Combine preprocessing and estimator in a pipeline
+    pipeline = Pipeline(steps=[
+        ('preprocessor', preprocessor),
+        ('model', estimator)
+    ])
+    # Define cross-validation strategy
+    if isinstance(cv, int):
+        cv = KFold(n_splits=cv, shuffle=True, random_state=42)
+    # GridSearchCV for classification
+    grid_search = GridSearchCV(
+        estimator=pipeline,
+        param_grid=param_grid,
+        cv=cv,
+        scoring=scoring,
+        n_jobs=-1
+    )
+    grid_search.fit(X, y)
+    best_model = grid_search.best_estimator_
+    best_params = grid_search.best_params_
+    print(f"Best Hyperparameters for {estimator_name}:")
+    for param_name in sorted(best_params.keys()):
+        print(f"{param_name}: {best_params[param_name]}")
+    return best_model, best_params