Upload 8 files

Browse files

Files changed (8) hide show

.gitignore +21 -0
CONTRIBUTING.md +7 -0
LICENSE +13 -0
README.md +69 -17
README_RETRAIN.md +110 -0
Scenario_heldout_final_PRECISE.py +22 -33
requirements.txt +8 -0
retrain_helper.py +232 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,21 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+.venv/
+.env
+# Data, models, caches
+models_*/
+models*/
+cache_dir/
+*.joblib
+*.pkl
+# Logs
+*.log
+# IDE
+.vscode/
+.idea/

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,7 @@

+Contribution guidelines
+- Fork the repo and open a pull request.
+- Add tests for new functionality.
+- Keep functions small and well-documented.
+- Use the existing coding style.

LICENSE ADDED Viewed

	@@ -0,0 +1,13 @@

+MIT License
+Copyright (c) 2025
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+[...standard MIT license continues...]

README.md CHANGED Viewed

@@ -1,17 +1,69 @@
----
-license: apache-2.0
-language:
-- en
-metrics:
-- precision
-- accuracy
-pipeline_tag: image-classification
-tags:
-- biology
-- cancer
-- glioblastoma
-- radiomics
-- transcriptomics
-- immune
-- stratification
----

+Project: PRECISE-GBM - Model training & retraining helpers
+Overview
+This repository contains code to train models (Gaussian Mixture labelling + SVM and ensemble classifiers) and to persist all artifacts required to reproduce or retrain models on new data. It includes:
+- `Scenario_heldout_final_PRECISE.py` — training pipeline producing `.joblib` models and metadata JSONs (selected features, best params, CV results).
+- `retrain_helper.py` — CLI utility to rebuild pipelines, set best params and retrain using saved selected-features and params JSONs. Supports JSON/YAML config files and auto-detection of model type.
+- `README_RETRAIN.md` — detailed retrain examples and a notebook cell.
+This repo also includes helper files to make it ready for GitHub:
+- `requirements.txt` — Python dependencies
+- `.gitignore` — recommended ignores (models, caches, logs)
+- `LICENSE` — MIT license
+- GitHub Actions workflow for CI (pytest smoke test)
+Getting started (Windows PowerShell)
+1) Create and activate a virtual environment
+```powershell
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+```
+2) Install dependencies
+```powershell
+pip install --upgrade pip
+pip install -r requirements.txt
+```
+3) Run training (note: the training script reads data from absolute paths configured in the script — adjust them or run from an environment where those files are present)
+```powershell
+python Scenario_heldout_final_PRECISE.py
+```
+The training script will create model files under `models_LM22/` and `models_GBM/` and write metadata JSONs next to each joblib model (selected features, params, cv results) as well as group-level JSON summaries.
+Retraining
+See `README_RETRAIN.md` for detailed CLI and notebook examples. Short example:
+```powershell
+python retrain_helper.py \
+  --model-prefix "models_GBM/scenario_1/GBM_scen1_Tcell" \
+  --train-csv "data\new_train.csv" \
+  --label-col "label"
+```
+Notes
+- The training script contains hard-coded absolute paths to data files. Before running on another machine, update the `scenarios_*` file paths or place the datasets in the same paths.
+- Retrain helper auto-detects model type when `--model-type` is omitted by looking for `{prefix}_svm_params.json` or `{prefix}_ens_params.json`.
+- YAML config support for retrain requires PyYAML (`pip install pyyaml`).
+CI
+A basic GitHub Actions workflow runs a smoke pytest to ensure the retrain helper imports and basic pipeline construction works. It does not run heavy training.
+Contributing
+See `CONTRIBUTING.md` for guidance on opening issues and PRs.
+License
+This project is released under the MIT License — see `LICENSE`.

README_RETRAIN.md ADDED Viewed

	@@ -0,0 +1,110 @@

+Retrain helper — quick start
+This file shows a minimal end-to-end example (CLI and notebook cell) to retrain a saved model using the `retrain_helper.py` utility.
+1) Quick CLI example
+From the project root (PowerShell):
+```powershell
+# Retrain SVM using explicit CLI args
+python retrain_helper.py \
+  --model-prefix "models_GBM/scenario_1/GBM_scen1_Tcell" \
+  --model-type svm \
+  --train-csv "data\new_train.csv" \
+  --label-col "label"
+# Or let the helper auto-detect model type (it looks for *_svm_params.json or *_ens_params.json)
+python retrain_helper.py \
+  --model-prefix "models_GBM/scenario_1/GBM_scen1_Tcell" \
+  --train-csv "data\new_train.csv" \
+  --label-col "label"
+# Using a JSON config file (CLI args override config values)
+python retrain_helper.py --config retrain_config.json
+```
+2) Example config files
+JSON (retrain_config.json):
+```json
+{
+  "model_prefix": "models_GBM/scenario_1/GBM_scen1_Tcell",
+  "train_csv": "data/new_train.csv",
+  "label_col": "label",
+  "out_dir": "models_GBM/scenario_1/retrained",
+  "overwrite": false
+}
+```
+YAML (retrain_config.yml):
+```yaml
+model_prefix: models_GBM/scenario_1/GBM_scen1_Tcell
+train_csv: data/new_train.csv
+label_col: label
+out_dir: models_GBM/scenario_1/retrained
+overwrite: false
+```
+3) Notebook / Jupyter cell example (end-to-end)
+This cell shows minimal steps to (A) run the CLI retrain helper using Python's subprocess, then (B) load the retrained model and run a quick prediction.
+```python
+# Notebook cell (Jupyter / Colab / Kaggle) - Python
+import subprocess
+import json
+from joblib import load
+import pandas as pd
+# 1) Run retrain (will create a timestamped retrained model and metadata JSON)
+cmd = [
+    "python", "retrain_helper.py",
+    "--model-prefix", "models_GBM/scenario_1/GBM_scen1_Tcell",
+    "--train-csv", "data/new_train.csv",
+    "--label-col", "label"
+]
+print('Running:', ' '.join(cmd))
+subprocess.check_call(cmd)
+# 2) Locate the retrain metadata file in the model-prefix folder (it has suffix _retrain_meta_YYYYMMDD_HHMMSS.json)
+# For the demo, search the output folder and load the latest metadata to find the retrained model path.
+import glob, os
+meta_files = glob.glob(os.path.join('models_GBM','scenario_1','GBM_scen1_Tcell*_retrain_meta_*.json'))
+meta_files = sorted(meta_files)
+print('Found meta files:', meta_files[-3:])
+meta = json.load(open(meta_files[-1]))
+model_path = meta['model_file']
+print('Retrained model path:', model_path)
+# 3) Load retrained model and perform a smoke prediction
+pipe = load(model_path)
+df = pd.read_csv('data/new_train.csv', index_col=0)
+sel_meta = json.load(open('models_GBM/scenario_1/GBM_scen1_Tcell_selected_features.json'))
+selected_features = sel_meta.get('selected_features', sel_meta)
+X = df[selected_features]
+print('Predict shape', X.shape)
+probs = pipe.predict_proba(X)[:5]
+print('Example probs (first 5 rows):', probs)
+```
+4) Notes & troubleshooting
+- The retrain helper expects the following files to exist next to the model prefix:
+  - `{prefix}_selected_features.json` — produced by the training script and contains `selected_features` list inside metadata
+  - `{prefix}_svm_params.json` or `{prefix}_ens_params.json` — best params metadata
+- If parameter keys don't map to the pipeline built in `retrain_helper.py`, `pipe.set_params(**best_params)` may raise; in that case the script prints a warning and fits the pipeline with default parameter values.
+- If you want to continue training from a saved estimator object instead of rebuilding the pipeline, modify the helper to `load` the .joblib file and call `.fit()` on it.
+- YAML support requires PyYAML (`pip install pyyaml`).
+5) Example minimal workflow to add to your notebook
+- Run your `Scenario_heldout_final_PRECISE.py` training script to produce models and metadata.
+- Prepare a CSV of new training data with the same column names as the original radiomics/immune features (index column required).
+- Use `retrain_helper.py` through the CLI or config to retrain.

Scenario_heldout_final_PRECISE.py CHANGED Viewed

@@ -21,13 +21,13 @@ from sklearn.metrics import (
     accuracy_score, precision_score, recall_score,
     f1_score, balanced_accuracy_score, matthews_corrcoef
 )
-from joblib import dump
 # -------------------------
 # Logging & warnings
 # -------------------------
 logging.basicConfig(
-    filename='nested_lodo_groupsv1.log',
     level=logging.INFO,
     format='%(asctime)s - %(levelname)s - %(message)s'
 )
@@ -35,22 +35,20 @@ warnings.filterwarnings('ignore', category=UserWarning)
 warnings.filterwarnings('ignore', category=ConvergenceWarning)
 # Create directories for saving models if they don't exist
-os.makedirs('models_GBMv1/scenario_1', exist_ok=True)
-os.makedirs('models_GBMv1/scenario_2', exist_ok=True)
-os.makedirs('models_GBMv1/scenario_3', exist_ok=True)
-os.makedirs('models_LM22v1/scenario_1', exist_ok=True)
-os.makedirs('models_LM22v1/scenario_2', exist_ok=True)
-os.makedirs('models_LM22v1/scenario_3', exist_ok=True)
 # -------------------------
 # Caching for pipelines
 # -------------------------
-# Joblib.Memory cache disabled to avoid creating cache directories and
-# PermissionError race conditions on Windows when using parallel workers.
-memory = None
-logging.info("Joblib Memory disabled; no pipeline caching will be used")
 # Helper: convert numpy scalars/arrays and dicts into JSON-serializable Python types
 def _convert_obj(o):
     """Recursively convert numpy types/arrays to native Python objects for JSON dumping."""
@@ -67,7 +65,7 @@ def _convert_obj(o):
     if isinstance(o, (list, tuple)):
         return [_convert_obj(v) for v in o]
     # numpy scalar -> python native
-    if isinstance(o, (np.integer, np.floating, np.bool_)):
         return o.item()
     # otherwise return as-is
     return o
@@ -88,7 +86,7 @@ def _cv_results_to_serializable(cv_dict):
 # -------------------------
 # Utility: two-step Lasso selection
 # -------------------------
-def select_features(X, y, alphas=(0.1, 0.01), cv=5, max_iter=10000, n_jobs=1, random_state=42):
     for alpha in alphas:
         lasso = LassoCV(
             alphas=[alpha], cv=cv,
@@ -217,7 +215,6 @@ for sig_name, scenarios in signature_groups.items():
                     y_tr = 1 - y_tr; y_ho = 1 - y_ho
                 # save gmm model
                 gmm_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_gmm_model.joblib'
-                os.makedirs(os.path.dirname(gmm_model_path), exist_ok=True)
                 dump(gmm, gmm_model_path)
                 logging.info(f"Saved GMM model to {gmm_model_path}")
                 logging.info(f"GMM means for {sig_name}:{scen_id}, col {col}: {gmm.means_.flatten().tolist()}")
@@ -239,19 +236,14 @@ for sig_name, scenarios in signature_groups.items():
                     json.dump(meta, _f, indent=2)
                 # SVM nested CV
-                # Avoid using joblib.Memory at the Pipeline level when running parallel CV (n_jobs != 1).
-                # Joblib's Memory can hit race conditions on Windows when multiple workers try to
-                # read/write the same cache files which leads to PermissionError (output.pkl).
-                # We therefore disable pipeline caching here (memory=None). This does NOT affect
-                # saving final models or params (those are written explicitly with dump/json below).
                 pipe_svm = Pipeline([
                     ('scaler', StandardScaler()),
                     ('clf', SVC(class_weight='balanced', probability=True, random_state=42))
-                ], memory=None)
                 search_svm = RandomizedSearchCV(
                     pipe_svm, param_dist_svm, n_iter=5,
                     cv=inner_cv, scoring='balanced_accuracy',
-                    n_jobs=1, refit=True, error_score='raise'
                 )
                 search_svm.fit(X_tr_sel, y_tr)
                 y_pred_svm = search_svm.predict(X_ho_sel)
@@ -259,7 +251,6 @@ for sig_name, scenarios in signature_groups.items():
                           for k, v in search_svm.cv_results_.items()}
                 # save SVM model
                 svm_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_svm_model.joblib'
-                os.makedirs(os.path.dirname(svm_model_path), exist_ok=True)
                 dump(search_svm.best_estimator_, svm_model_path)
                 logging.info(f"Saved SVM model to {svm_model_path}")
                 logging.info(f"SVM best params for {sig_name}:{scen_id}, col {col}: {search_svm.best_params_}")
@@ -287,20 +278,20 @@ for sig_name, scenarios in signature_groups.items():
                 base_pipe = Pipeline([
                     ('scaler', StandardScaler()),
                     ('classifier', SVC(class_weight='balanced', probability=True, random_state=42))
-                ], memory=None)
                 ensemble = VotingClassifier([
                     ('svm', base_pipe),
                     ('rf', RandomForestClassifier(class_weight='balanced', random_state=42)),
                     ('gb', HistGradientBoostingClassifier(random_state=42))
-                ], voting='soft', weights=[1,1,1], n_jobs=1)
                 pipe_ens = Pipeline([
                     ('scaler', StandardScaler()),
                     ('ensemble', ensemble)
-                ], memory=None)
                 search_ens = RandomizedSearchCV(
                     pipe_ens, param_dist_ensemble, n_iter=3,
                     cv=inner_cv, scoring='balanced_accuracy',
-                    n_jobs=1, refit=True, error_score='raise'
                 )
                 search_ens.fit(X_tr_sel, y_tr)
                 y_pred_ens = search_ens.predict(X_ho_sel)
@@ -308,7 +299,6 @@ for sig_name, scenarios in signature_groups.items():
                           for k, v in search_ens.cv_results_.items()}
                 # save Ensemble model
                 ens_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_ens_model.joblib'
-                os.makedirs(os.path.dirname(ens_model_path), exist_ok=True)
                 dump(search_ens.best_estimator_, ens_model_path)
                 logging.info(f"Saved Ensemble model to {ens_model_path}")
                 logging.info(f"Ensemble best params for {sig_name}:{scen_id}, col {col}: {search_ens.best_params_}")
@@ -347,8 +337,7 @@ for sig_name, scenarios in signature_groups.items():
                 scen_cv[col] = {'svm_cv': cv_svm, 'ensemble_cv': cv_ens}
             except Exception as e:
-                # log full traceback for easier debugging (written to nested_lodo_groupsv1.log)
-                logging.exception(f"{sig_name}:{scen_id}, col {col}: unexpected error")
                 print(f"[ERROR] {sig_name}:{scen_id}, column {col}: {e}")
         # Save for this scenario
@@ -358,11 +347,11 @@ for sig_name, scenarios in signature_groups.items():
         logging.info(f"[{sig_name}] {scen_id} done in {time.time()-t0:.1f}s")
     # Write group-level JSONs
-    with open(f'nestedv1_results111_{sig_name}.json', 'w') as f:
         json.dump(all_results, f, indent=2)
-    with open(f'nestedv1_features111_{sig_name}.json', 'w') as f:
         json.dump(all_features, f, indent=2)
-    with open(f'nestedv1_cv111_{sig_name}.json', 'w') as f:
         json.dump(all_cv, f, indent=2)
     print(f"✅ {sig_name} group complete: scenarios={list(all_results.keys())}")

     accuracy_score, precision_score, recall_score,
     f1_score, balanced_accuracy_score, matthews_corrcoef
 )
+from joblib import Memory, dump
 # -------------------------
 # Logging & warnings
 # -------------------------
 logging.basicConfig(
+    filename='nested_lodo_groups.log',
     level=logging.INFO,
     format='%(asctime)s - %(levelname)s - %(message)s'
 )
 warnings.filterwarnings('ignore', category=ConvergenceWarning)
 # Create directories for saving models if they don't exist
+os.makedirs('models_GBM/scenario_1', exist_ok=True)
+os.makedirs('models_GBM/scenario_2', exist_ok=True)
+os.makedirs('models_GBM/scenario_3', exist_ok=True)
+os.makedirs('models_LM22/scenario_1', exist_ok=True)
+os.makedirs('models_LM22/scenario_2', exist_ok=True)
+os.makedirs('models_LM22/scenario_3', exist_ok=True)
 # -------------------------
 # Caching for pipelines
 # -------------------------
+memory = Memory(location='cache_dir', verbose=0)
 # Helper: convert numpy scalars/arrays and dicts into JSON-serializable Python types
+import numpy as _np
 def _convert_obj(o):
     """Recursively convert numpy types/arrays to native Python objects for JSON dumping."""
     if isinstance(o, (list, tuple)):
         return [_convert_obj(v) for v in o]
     # numpy scalar -> python native
+    if isinstance(o, (_np.integer, _np.floating, _np.bool_)):
         return o.item()
     # otherwise return as-is
     return o
 # -------------------------
 # Utility: two-step Lasso selection
 # -------------------------
+def select_features(X, y, alphas=(0.1, 0.01), cv=5, max_iter=10000, n_jobs=-1, random_state=42):
     for alpha in alphas:
         lasso = LassoCV(
             alphas=[alpha], cv=cv,
                     y_tr = 1 - y_tr; y_ho = 1 - y_ho
                 # save gmm model
                 gmm_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_gmm_model.joblib'
                 dump(gmm, gmm_model_path)
                 logging.info(f"Saved GMM model to {gmm_model_path}")
                 logging.info(f"GMM means for {sig_name}:{scen_id}, col {col}: {gmm.means_.flatten().tolist()}")
                     json.dump(meta, _f, indent=2)
                 # SVM nested CV
                 pipe_svm = Pipeline([
                     ('scaler', StandardScaler()),
                     ('clf', SVC(class_weight='balanced', probability=True, random_state=42))
+                ], memory=memory)
                 search_svm = RandomizedSearchCV(
                     pipe_svm, param_dist_svm, n_iter=5,
                     cv=inner_cv, scoring='balanced_accuracy',
+                    n_jobs=-1, refit=True, error_score='raise'
                 )
                 search_svm.fit(X_tr_sel, y_tr)
                 y_pred_svm = search_svm.predict(X_ho_sel)
                           for k, v in search_svm.cv_results_.items()}
                 # save SVM model
                 svm_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_svm_model.joblib'
                 dump(search_svm.best_estimator_, svm_model_path)
                 logging.info(f"Saved SVM model to {svm_model_path}")
                 logging.info(f"SVM best params for {sig_name}:{scen_id}, col {col}: {search_svm.best_params_}")
                 base_pipe = Pipeline([
                     ('scaler', StandardScaler()),
                     ('classifier', SVC(class_weight='balanced', probability=True, random_state=42))
+                ], memory=memory)
                 ensemble = VotingClassifier([
                     ('svm', base_pipe),
                     ('rf', RandomForestClassifier(class_weight='balanced', random_state=42)),
                     ('gb', HistGradientBoostingClassifier(random_state=42))
+                ], voting='soft', weights=[1,1,1], n_jobs=-1)
                 pipe_ens = Pipeline([
                     ('scaler', StandardScaler()),
                     ('ensemble', ensemble)
+                ], memory=memory)
                 search_ens = RandomizedSearchCV(
                     pipe_ens, param_dist_ensemble, n_iter=3,
                     cv=inner_cv, scoring='balanced_accuracy',
+                    n_jobs=-1, refit=True, error_score='raise'
                 )
                 search_ens.fit(X_tr_sel, y_tr)
                 y_pred_ens = search_ens.predict(X_ho_sel)
                           for k, v in search_ens.cv_results_.items()}
                 # save Ensemble model
                 ens_model_path = f'models_{sig_name}/scenario_{scen_id}/{sig_name}_scen{scen_id}_{col}_ens_model.joblib'
                 dump(search_ens.best_estimator_, ens_model_path)
                 logging.info(f"Saved Ensemble model to {ens_model_path}")
                 logging.info(f"Ensemble best params for {sig_name}:{scen_id}, col {col}: {search_ens.best_params_}")
                 scen_cv[col] = {'svm_cv': cv_svm, 'ensemble_cv': cv_ens}
             except Exception as e:
+                logging.error(f"{sig_name}:{scen_id}, col {col}: {e}")
                 print(f"[ERROR] {sig_name}:{scen_id}, column {col}: {e}")
         # Save for this scenario
         logging.info(f"[{sig_name}] {scen_id} done in {time.time()-t0:.1f}s")
     # Write group-level JSONs
+    with open(f'nested_results111_{sig_name}.json', 'w') as f:
         json.dump(all_results, f, indent=2)
+    with open(f'nested_features111_{sig_name}.json', 'w') as f:
         json.dump(all_features, f, indent=2)
+    with open(f'nested_cv111_{sig_name}.json', 'w') as f:
         json.dump(all_cv, f, indent=2)
     print(f"✅ {sig_name} group complete: scenarios={list(all_results.keys())}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+numpy
+pandas
+scikit-learn
+joblib
+tqdm
+pyyaml
+pytest

retrain_helper.py ADDED Viewed

	@@ -0,0 +1,232 @@

+"""retrain_helper.py
+Small CLI to retrain saved SVM or Ensemble models using saved metadata.
+Enhancements in this version:
+- Accept a JSON or YAML config file via --config with keys: model_prefix, model_type (optional), train_csv, label_col, out_dir (optional)
+- If model_type is omitted, auto-detect by checking for *_svm_params.json or *_ens_params.json next to the prefix
+- CLI arguments override config values
+Usage (from project root):
+python retrain_helper.py --model-prefix "models_GBM/scenario_1/GBM_scen1_Tcell" --model-type svm --train-csv new_train.csv --label-col label
+or using config.json/yaml:
+python retrain_helper.py --config retrain_config.json
+The script expects files with these suffixes next to the prefix:
+- _selected_features.json  (contains metadata.selected_features list)
+- _svm_params.json or _ens_params.json  (contains metadata.best_params)
+It builds pipelines matching the original script, sets the best params, fits on the provided CSV using the selected features, and saves a retrained joblib model and a metadata JSON.
+"""
+import argparse
+import json
+import os
+from datetime import datetime, timezone
+from joblib import dump
+import pandas as pd
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.svm import SVC
+from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, VotingClassifier
+# optional yaml support
+try:
+    import yaml
+    _HAS_YAML = True
+except Exception:
+    _HAS_YAML = False
+def load_json_meta(path):
+    with open(path, 'r') as f:
+        return json.load(f)
+def load_config(path):
+    """Load JSON or YAML config file into a dict."""
+    if path.lower().endswith(('.yaml', '.yml')):
+        if not _HAS_YAML:
+            raise RuntimeError('PyYAML is not installed, cannot read YAML config')
+        with open(path, 'r') as f:
+            return yaml.safe_load(f)
+    else:
+        with open(path, 'r') as f:
+            return json.load(f)
+def build_svm_pipeline():
+    pipe = Pipeline([
+        ('scaler', StandardScaler()),
+        ('clf', SVC(class_weight='balanced', probability=True, random_state=42))
+    ])
+    return pipe
+def build_ensemble_pipeline():
+    # base pipe inside voting ensemble should be named and structured like in training script
+    base_pipe = Pipeline([
+        ('scaler', StandardScaler()),
+        ('classifier', SVC(class_weight='balanced', probability=True, random_state=42))
+    ])
+    ensemble = VotingClassifier([
+        ('svm', base_pipe),
+        ('rf', RandomForestClassifier(class_weight='balanced', random_state=42)),
+        ('gb', HistGradientBoostingClassifier(random_state=42))
+    ], voting='soft', weights=[1, 1, 1])
+    pipe = Pipeline([
+        ('scaler', StandardScaler()),
+        ('ensemble', ensemble)
+    ])
+    return pipe
+def _auto_detect_model_type(model_prefix):
+    """Return 'svm' or 'ens' based on presence of params files next to the prefix.
+    If both present, prefer 'svm' and warn."""
+    svm_path = model_prefix + '_svm_params.json'
+    ens_path = model_prefix + '_ens_params.json'
+    svm_exists = os.path.exists(svm_path)
+    ens_exists = os.path.exists(ens_path)
+    if svm_exists and not ens_exists:
+        return 'svm'
+    if ens_exists and not svm_exists:
+        return 'ens'
+    if svm_exists and ens_exists:
+        print('Warning: both SVM and Ensemble params found; defaulting to SVM')
+        return 'svm'
+    # if neither exists, raise
+    raise FileNotFoundError(f'Neither {svm_path} nor {ens_path} found for auto-detection')
+def retrain(model_prefix, model_type=None, train_csv=None, label_col=None, out_dir=None, overwrite=False):
+    """Retrain a saved model using the saved selected-features and best-params metadata.
+    model_type can be 'svm' or 'ens' (ensemble). If None, the function will try to auto-detect.
+    """
+    if model_type is None:
+        model_type = _auto_detect_model_type(model_prefix)
+    # Resolve file paths
+    sel_path = model_prefix + '_selected_features.json'
+    if model_type.lower() == 'svm':
+        params_path = model_prefix + '_svm_params.json'
+    elif model_type.lower() in ('ens', 'ensemble'):
+        params_path = model_prefix + '_ens_params.json'
+    else:
+        raise ValueError('model_type must be "svm" or "ens"')
+    if not os.path.exists(sel_path):
+        raise FileNotFoundError(f'Selected-features file not found: {sel_path}')
+    if not os.path.exists(params_path):
+        raise FileNotFoundError(f'Params file not found: {params_path}')
+    if train_csv is None or not os.path.exists(train_csv):
+        raise FileNotFoundError(f'Train CSV not found: {train_csv}')
+    sel_meta = load_json_meta(sel_path)
+    # selected features are stored under top-level key 'selected_features' (script writes metadata)
+    if isinstance(sel_meta, dict) and 'selected_features' in sel_meta:
+        sel_features = sel_meta['selected_features']
+    elif isinstance(sel_meta, list):
+        sel_features = sel_meta
+    else:
+        raise ValueError('Unexpected selected features file format')
+    params_meta = load_json_meta(params_path)
+    # params saved under 'best_params' inside metadata
+    if isinstance(params_meta, dict) and 'best_params' in params_meta:
+        best_params = params_meta['best_params']
+    else:
+        # fallback: file may contain bare params
+        best_params = params_meta
+    # load training data and subset columns
+    df = pd.read_csv(train_csv, index_col=0)
+    missing = [c for c in sel_features if c not in df.columns]
+    if missing:
+        raise ValueError(f'The following selected features are missing from training CSV: {missing}')
+    X = df[sel_features].values
+    y = df[label_col].values
+    # Build pipeline and set params
+    if model_type.lower() == 'svm':
+        pipe = build_svm_pipeline()
+    else:
+        pipe = build_ensemble_pipeline()
+    # set params (keys should match the original training param names)
+    try:
+        pipe.set_params(**best_params)
+    except Exception as e:
+        print('Warning: failed to set all params on pipeline:', e)
+        # continue anyway
+    # Fit
+    print(f'Fitting {model_type} on {X.shape[0]} samples with {X.shape[1]} features...')
+    pipe.fit(X, y)
+    # Save retrained model
+    ts = datetime.now(timezone.utc).strftime('%Y%m%d_%H%M%S')
+    if out_dir is None:
+        out_dir = os.path.dirname(model_prefix) or '.'
+    os.makedirs(out_dir, exist_ok=True)
+    model_out_path = os.path.join(out_dir, os.path.basename(model_prefix) + f'_{model_type}_retrained_{ts}.joblib')
+    # respect overwrite flag
+    if os.path.exists(model_out_path) and not overwrite:
+        raise FileExistsError(f'Model output already exists: {model_out_path}. Use overwrite=True to overwrite.')
+    dump(pipe, model_out_path)
+    # Save retrain metadata
+    meta = {
+        'retrained_at': datetime.now(timezone.utc).isoformat(),
+        'version': ts,
+        'model_type': model_type,
+        'n_samples': int(X.shape[0]),
+        'n_features': int(X.shape[1]),
+        'selected_features_file': os.path.abspath(sel_path),
+        'params_file': os.path.abspath(params_path),
+        'model_file': os.path.abspath(str(model_out_path))
+    }
+    meta_out = os.path.join(out_dir, os.path.basename(model_prefix) + f'_{model_type}_retrain_meta_{ts}.json')
+    with open(meta_out, 'w') as f:
+        json.dump(meta, f, indent=2)
+    print('Retrained model saved to:', model_out_path)
+    print('Retrain metadata saved to:', meta_out)
+    return model_out_path, meta_out
+def main():
+    p = argparse.ArgumentParser(description='Retrain a saved model using saved selected features and best params')
+    p.add_argument('--config', required=False, help='Path to JSON or YAML config file with keys: model_prefix, model_type (optional), train_csv, label_col, out_dir')
+    p.add_argument('--model-prefix', required=False, help='Path prefix to model files (without suffix). E.g. models_GBM/scenario_1/GBM_scen1_Tcell')
+    p.add_argument('--model-type', required=False, choices=['svm', 'ens', 'ensemble'], help='svm or ens (if omitted, auto-detect)')
+    p.add_argument('--train-csv', required=False, help='CSV with training data (index column present). Must contain selected features and label column')
+    p.add_argument('--label-col', required=False, help='Name of the label column in train CSV')
+    p.add_argument('--out-dir', default=None, help='Output directory (defaults to model-prefix directory)')
+    p.add_argument('--overwrite', action='store_true', help='Overwrite existing output files')
+    args = p.parse_args()
+    cfg = {}
+    if args.config:
+        cfg = load_config(args.config) or {}
+    # Merge config and CLI args; CLI takes precedence
+    model_prefix = args.model_prefix or cfg.get('model_prefix')
+    model_type = args.model_type or cfg.get('model_type')
+    train_csv = args.train_csv or cfg.get('train_csv')
+    label_col = args.label_col or cfg.get('label_col')
+    out_dir = args.out_dir or cfg.get('out_dir')
+    overwrite = args.overwrite or cfg.get('overwrite', False)
+    if model_prefix is None or train_csv is None or label_col is None:
+        raise ValueError('model_prefix, train_csv and label_col must be provided either via --config or CLI args')
+    retrain(model_prefix, model_type=model_type, train_csv=train_csv, label_col=label_col, out_dir=out_dir, overwrite=overwrite)
+if __name__ == '__main__':
+    main()