Spaces:

mschuh
/

MultiTaskTox

Sleeping

App Files Files Community

mschuh commited on Nov 20, 2025

Commit

0dd7279

verified ·

1 Parent(s): 759324e

Delete docs

Browse files

Files changed (1) hide show

docs/proposed_lightgbm_framework.md +0 -203

docs/proposed_lightgbm_framework.md DELETED Viewed

@@ -1,203 +0,0 @@
-# LightGBM-Based Multitask Workflow for Tox21
-This document proposes a stepwise plan to replace the current GIN baseline (`train.py`, `predict.py`, `src/`) with a Gradient Boosting pipeline that remains compatible with the leaderboard I/O contract. Each phase can be validated independently before moving to the next, ensuring we have working training and inference artifacts at all times.
----
-## 0. Repository Integration Checklist
-- **Entry-points stay the same.** `train.py` must continue to train from `config/config.json` and drop an inference-ready artifact into `checkpoints/`. `predict.py` must keep the `predict(smiles_list)` signature and return the nested `{smiles: {target: score}}` mapping.
-- **New modules.** Introduce `src/features.py` (fingerprints & caching), `src/lightgbm_trainer.py` (shared utilities for training/evaluation), and `src/stage_two.py` (cross-task augmentation logic). Keep `src/preprocess.py` for SMILES standardization + RDKit `Mol` construction so inference stays aligned with training.
-- **Dependencies.** Add `lightgbm`, `optuna`, `rdkit-pypi`, and optionally `map4` or `map4` reference code to `requirements.txt`. Verify any native dependencies are supported by the Spaces environment.
-- **Artifacts.** Store per-task boosters as `checkpoints/stage1_{task}.txt` and `checkpoints/stage2_{task}.txt` (LightGBM text dumps). Derived predictions (e.g., stage-1 OOF matrices) should live under `checkpoints/cache/` or `/tmp` during training, but inference must rely only on checkpoint files generated by `train.py`.
----
-## 1. Phase 1 — Baseline LightGBM with Optuna
-### 1.1 Data handling
-1. Load the Hugging Face dataset inside `train.py` exactly as today (`load_dataset("ml-jku/tox21", token=TOKEN)`).
-2. Keep the same per-split segmentation (train/validation/test) to remain comparable with the GIN baseline.
-3. Convert SMILES strings to RDKit `Mol` objects using the existing cleaners in `src/preprocess.py`. For the baseline, we can featurize molecules with a minimal descriptor set (e.g., RDKit physicochemical descriptors) while fingerprints are being implemented.
-### 1.2 Baseline features
-Use easily-computed descriptors such as:
-- Molecular weight, logP, TPSA, number of H-bond donors/acceptors, rotatable bonds, aromatic proportion, etc.
-- Concatenate one-hot encodings for atom count bins (C, N, O, halogens).
-This gives a quick tabular vector per SMILES while fingerprint work is in progress.
-### 1.3 Training objective
-- **Task granularity:** Train one LightGBM binary classifier per Tox21 task (12 total). Targets remain the provided binary toxicity labels.
-- **Metric:** ROC-AUC per task, with macro-average for reporting (mirrors leaderboard metric).
-- **Data split:** For each task, drop rows with missing labels and perform K-fold CV (e.g., 5 folds) inside Optuna to make best use of labeled data.
-### 1.4 Optuna search space
-Within `src/lightgbm_trainer.py`, expose an `objective(trial, task_name)` that:
-1. Samples:
-   - `learning_rate ∈ [1e-3, 0.2]` (log scale)
-   - `num_leaves ∈ [16, 256]`
-   - `max_depth ∈ [-1, 12]`
-   - `min_data_in_leaf ∈ [10, 200]`
-   - `feature_fraction ∈ [0.5, 1.0]`
-   - `bagging_fraction ∈ [0.5, 1.0]` with `bagging_freq ∈ [1, 10]`
-   - `lambda_l1`, `lambda_l2` (10^-8 to 10^1)
-2. Trains the LightGBM model on each CV split and averages ROC-AUC.
-3. Returns the negative mean ROC-AUC so Optuna can minimize the objective.
-Persist the best hyperparameters per task into the config (or a JSON artifact) so `predict.py` can instantiate the booster with exact values. When data volume is small, Optuna’s `Study` can share the same random seed for reproducibility (`src/seed.py` can be reused).
-### 1.5 Deliverables for Phase 1
-- Updated `train.py` calling into `src/lightgbm_trainer.train_single_task(task_name, features, labels, config)`.
-- `checkpoints/stage1_{task}.txt` boosters (even though they are “stage 1”, they form the baseline deliverable).
-- Validation report (per-task ROC-AUC) saved to `checkpoints/metrics_stage1.json`.
-- `predict.py` loads each per-task LightGBM model, computes baseline descriptors on-the-fly, and returns predictions.
----
-## 2. Phase 2 — Fingerprint-Based Representations
-### 2.1 Feature computation
-Implement `src/features.py` with methods:
-- `compute_ecfp(mol, radius=2, n_bits=1024)` using `GetMorganFingerprintAsBitVect`.
-- `compute_map4(mol)` via MAP4 codebase (counts hashed patterns). Because MAP4 is computationally heavier, cache features to disk (e.g., `cache/fingerprints_{split}.npz`).
-- `fingerprint_pipeline(smiles_list, fingerprint_type)` that accepts sanitized SMILES, constructs `Mol` objects, and returns a dense `np.ndarray`.
-### 2.2 Integration
-- Update `train.py` to choose the fingerprint type from config (e.g., `config["features"]["type"] = "ecfp"`).
-- Align `predict.py` to call the same fingerprint builder on incoming SMILES.
-- Maintain metadata describing fingerprint dimensionality and type in a manifest (e.g., `checkpoints/features.json`) so inference knows how to parse the stored LightGBM feature order.
-### 2.3 Training flow
-Apart from the enriched features, Phase 2 reuses the Phase 1 training loop. If resource constraints exist, we can:
-- Run Optuna once on a representative task (e.g., NR-AhR) and reuse its best hyperparameters for all tasks; or
-- Run Optuna briefly per task (e.g., 30 trials) and share results.
-### 2.4 Deliverables
-- Fingerprint cache builders + unit tests (small set of SMILES).
-- Configurable training/inference that toggles between baseline descriptors and fingerprint vectors.
-- Updated metrics comparing descriptors vs. ECFP vs. MAP4.
----
-## 3. Phase 3 — Cross-Task Label Augmentation
-### 3.1 Motivation
-By incorporating predictions from other tasks, we expose LightGBM to shared toxicity patterns without building a fully joint model. This is especially valuable for underrepresented tasks where correlated labels provide additional signal.
-### 3.2 Feature construction
-Given `T = 12` tasks and fingerprint dimension `D`, the augmented features for task `k` are:
-```
-X_k = [fingerprint_vector (D dims), ŷ_1, …, ŷ_{k-1}, ŷ_{k+1}, …, ŷ_T]
-```
-where `ŷ_t` are the stage-1 predictions for task `t` on the same molecule. Use floats instead of hard labels to preserve uncertainty.
-### 3.3 Implementation details
-1. **Collect stage-1 predictions.**
-   - After Phase 2 training, run inference with each stage-1 model on every molecule in train/val/test splits.
-   - Store the `N × T` prediction matrix in `checkpoints/stage1_predictions_{split}.npz`.
-2. **Align missing data.**
-   - If task `t` lacks a label for a molecule, mask it during stage-1 training but still compute predictions for other tasks so the feature matrix stays dense.
-3. **Data leakage prevention.**
-   - During training, use out-of-fold predictions (OOF) for the stage-1 features so models do not see their own ground-truth labels through the augmented vector.
-   - Implementation: For each fold, train stage-1 LightGBM on K-1 folds, predict on the held-out fold, and concatenate predictions.
-4. **Config surface.**
-   - `config["multitask"]["use_stage1_predictions"] = true/false`
-   - `config["multitask"]["prediction_source"] = "oof" | "full_train"` to switch between strict OOF features and simpler (but leakier) full-train predictions for debugging.
-### 3.4 Training
-Once augmented features are ready, rerun the single-task LightGBM training per target (`stage2`). Hyperparameter search can be narrower because fingerprints already provide a strong baseline; focus on `num_leaves`, `feature_fraction`, and regularization strength.
-### 3.5 Deliverables
-- Scripts that generate OOF prediction matrices.
-- Updated `train.py` orchestration:
-  1. Train Stage 1 models.
-  2. Materialize cross-task prediction cache.
-  3. Train Stage 2 models from augmented features.
-- Metrics comparing Stage 1 vs. Stage 2 per task.
----
-## 4. Phase 4 — Two-Stage Training & Inference
-### 4.1 Training orchestration
-Pseudo-flow for `train.py`:
-```python
-def train(config):
-    ds = load_dataset(...)
-    mols = preprocess.standardize(ds["train"]["smiles"])
-    fp_cache = features.fingerprint_pipeline(mols, config["features"])
-    stage1 = StageOneTrainer(config)
-    stage1.train_all_tasks(fp_cache, labels, splits)
-    stage1.save_models("checkpoints/stage1_*.txt")
-    pred_cache = stage1.generate_predictions(fp_cache, splits, use_oof=True)
-    stage2 = StageTwoTrainer(config)
-    stage2.train_all_tasks(fp_cache, pred_cache, labels)
-    stage2.save_models("checkpoints/stage2_*.txt")
-    dump_metrics(stage1.metrics, stage2.metrics)
-```
-### 4.2 Inference pipeline (`predict.py`)
-1. **Fingerprint computation:** identical to training (deterministic sanitization).
-2. **Stage-1 pass:** Load every `stage1_{task}.txt`, predict on the incoming SMILES batch, and collect predictions.
-3. **Stage-2 pass:** For each task `k`, build `[fingerprint, predicted_labels_except_k]` on-the-fly and evaluate the corresponding stage-2 booster.
-4. **Output:** Return the stage-2 predictions for leaderboard submission. Optionally include stage-1 scores in the response if needed for debugging (but the official output should stick to stage-2 values).
-### 4.3 Failure modes & mitigations
-- **Unrecognized SMILES:** fall back to zeros or 0.5 predictions like the current baseline but log warnings so we can monitor failure rates.
-- **Missing checkpoint:** raise an informative exception instructing users to rerun `train.py`.
-- **Performance drift:** store SHA or timestamp metadata with checkpoints to trace which training configuration produced a given model.
----
-## 5. Configuration & Experiment Tracking
-Proposed structure for `config/config.json`:
-```json
-{
-  "seed": 42,
-  "features": {
-    "type": "ecfp",
-    "radius": 2,
-    "n_bits": 1024,
-    "use_counts": false
-  },
-  "training": {
-    "n_folds": 5,
-    "n_optuna_trials": 50,
-    "lightgbm_params": {
-      "objective": "binary",
-      "metric": "auc",
-      "verbosity": -1
-    }
-  },
-  "multitask": {
-    "enabled": true,
-    "use_stage1_predictions": true,
-    "prediction_source": "oof"
-  }
-}
-```
-Track experiment results in `checkpoints/experiments.csv` with columns `[timestamp, fingerprint, stage, task, auc, params_hash]`.
----
-## 6. Testing & Validation
-- **Unit tests:** Ensure fingerprint builders reproduce known vectors (compare with RDKit reference) and that cross-task feature assembly drops the correct task column.
-- **Integration tests:** Small toy dataset (3 tasks, <50 samples) to run the full Stage1→Stage2 pipeline quickly. Assert shapes of caches and that inference matches training predictions.
-- **Performance tracking:** Plot per-task ROC-AUC improvements by phase to confirm each enhancement adds value.
----
-## 7. Suggested Implementation Milestones
-1. **M1:** Skeleton LightGBM trainer + Optuna integration (Phase 1). ✓
-2. **M2:** Fingerprint computation module with caching + updated training/inference (Phase 2).
-3. **M3:** Stage-1 prediction cache + feature augmentation (Phase 3).
-4. **M4:** End-to-end Stage1→Stage2 orchestration, packaging of checkpoints, and inference updates (Phase 4).
-5. **M5:** Documentation + automated tests to guard against regressions.
-This phased roadmap keeps the leaderboard interface intact while progressively increasing the modeling capacity from simple descriptors to multitask-enhanced fingerprints.