Spaces:

mschuh
/

MultiTaskTox

Sleeping

App Files Files Community

Maximilian Schuh commited on Nov 20, 2025

Commit

759324e

1 Parent(s): e1aa0ed

Added new files and updated learning

Browse files

Files changed (37) hide show

README.md +2 -33
checkpoints/cache/stage1_train_predictions.npz +3 -0
checkpoints/cache/stage1_validation_predictions.npz +3 -0
checkpoints/cache/train_ecfp.npz +3 -0
checkpoints/cache/validation_ecfp.npz +3 -0
checkpoints/metrics_stage1.json +62 -0
checkpoints/metrics_stage2.json +50 -0
checkpoints/stage1/NR-AR-LBD.pkl +3 -0
checkpoints/stage1/NR-AR.pkl +3 -0
checkpoints/stage1/NR-AhR.pkl +3 -0
checkpoints/stage1/NR-Aromatase.pkl +3 -0
checkpoints/stage1/NR-ER-LBD.pkl +3 -0
checkpoints/stage1/NR-ER.pkl +3 -0
checkpoints/stage1/NR-PPAR-gamma.pkl +3 -0
checkpoints/stage1/SR-ARE.pkl +3 -0
checkpoints/stage1/SR-ATAD5.pkl +3 -0
checkpoints/stage1/SR-HSE.pkl +3 -0
checkpoints/stage1/SR-MMP.pkl +3 -0
checkpoints/stage1/SR-p53.pkl +3 -0
checkpoints/stage1_params.json +242 -0
checkpoints/stage2/NR-AR-LBD.pkl +3 -0
checkpoints/stage2/NR-AR.pkl +3 -0
checkpoints/stage2/NR-AhR.pkl +3 -0
checkpoints/stage2/NR-Aromatase.pkl +3 -0
checkpoints/stage2/NR-ER-LBD.pkl +3 -0
checkpoints/stage2/NR-ER.pkl +3 -0
checkpoints/stage2/NR-PPAR-gamma.pkl +3 -0
checkpoints/stage2/SR-ARE.pkl +3 -0
checkpoints/stage2/SR-ATAD5.pkl +3 -0
checkpoints/stage2/SR-HSE.pkl +3 -0
checkpoints/stage2/SR-MMP.pkl +3 -0
checkpoints/stage2/SR-p53.pkl +3 -0
checkpoints/training_manifest.json +41 -0
config/config.json +2 -2
requirements.txt +1 -0
src/lightgbm_trainer.py +94 -71
src/stage_two.py +64 -64

README.md CHANGED Viewed

@@ -20,30 +20,6 @@ MultiTaskTox is a two-stage Gradient Boosting workflow purpose-built for the [To
 - **Multitask enhancement** – stage two augments the fingerprint vector with the predictions of the other tasks, capturing label correlations without building a fully joint model.
 - **Leaderboard-ready interface** – `train.py` produces checkpoints and metadata under `checkpoints/`, while `predict.py` exposes the required `predict(smiles_list)` signature.
-## Installation
-```bash
-git clone https://huggingface.co/spaces/ml-jku/tox21_gin_classifier
-cd tox21_gin_classifier
-python -m venv .venv && source .venv/bin/activate
-pip install --upgrade pip
-pip install -r requirements.txt
-```
-The requirements include RDKit, LightGBM, Optuna, and the MAP4 fingerprint package so you can switch feature types via the config.
-## Training
-1. Create a `.env` file (all Hugging Face Spaces support secrets) with your dataset token:
-   ```
-   TOKEN=hf_xxx
-   ```
-2. Adjust `config/config.json` if needed (fingerprint type, Optuna trial count, etc.).
-3. Run:
-   ```bash
-   python train.py
-   ```
 ### What `train.py` does
 1. Loads the predefined `train` and `validation` splits from the Tox21 dataset.
@@ -88,7 +64,7 @@ The function:
   },
   "training": {
     "optuna_trials": 40,
-    "boosting_rounds": 1500,
     "early_stopping_rounds": 100,
     "lightgbm_params": {
       "objective": "binary",
@@ -104,6 +80,7 @@ The function:
 - Switch `features.type` to `"map4"` to use MAP4 fingerprints (installed by default).
 - Disable multitask behavior by setting `"multitask": {"enabled": false}`.
 - Increase `optuna_trials` for a more exhaustive search if compute allows.
 ## Repository Layout
@@ -116,11 +93,3 @@ The function:
 - `src/constants.py`, `src/seed.py` – shared utilities.
 - `docs/proposed_lightgbm_framework.md` – detailed design notes for the workflow.
 - `checkpoints/` – default output directory containing models, metrics, caches, and the training manifest used at inference time.
-## Tips
-- Training relies on the `TOKEN` environment variable to access the Tox21 dataset on Hugging Face. Locally you can omit it if the dataset is public for your account.
-- MAP4 fingerprints are more expensive to compute; enable the cache directory to avoid recomputation across runs.
-- Use the saved metrics files to compare stage-one vs. stage-two AUCs and to trace which configuration produced a set of checkpoints.
-Happy modeling! If you extend MultiTaskTox (new fingerprints, alternative learners, etc.), keep the `predict(smiles)` contract intact so your Space remains leaderboard compatible.

 - **Multitask enhancement** – stage two augments the fingerprint vector with the predictions of the other tasks, capturing label correlations without building a fully joint model.
 - **Leaderboard-ready interface** – `train.py` produces checkpoints and metadata under `checkpoints/`, while `predict.py` exposes the required `predict(smiles_list)` signature.
 ### What `train.py` does
 1. Loads the predefined `train` and `validation` splits from the Tox21 dataset.
   },
   "training": {
     "optuna_trials": 40,
+    "n_estimators": [50, 500, 1000],
     "early_stopping_rounds": 100,
     "lightgbm_params": {
       "objective": "binary",
 - Switch `features.type` to `"map4"` to use MAP4 fingerprints (installed by default).
 - Disable multitask behavior by setting `"multitask": {"enabled": false}`.
 - Increase `optuna_trials` for a more exhaustive search if compute allows.
+- Set `training.n_estimators` to either a single integer or a list of candidate values (default `[50, 500, 1000]`) to control the Optuna search space for the `n_estimators` hyperparameter.
 ## Repository Layout
 - `src/constants.py`, `src/seed.py` – shared utilities.
 - `docs/proposed_lightgbm_framework.md` – detailed design notes for the workflow.
 - `checkpoints/` – default output directory containing models, metrics, caches, and the training manifest used at inference time.

checkpoints/cache/stage1_train_predictions.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:019864a99cd005c40bc979da6ec6594398a02fdeaaedb943ea336b7603efab49
+size 565279

checkpoints/cache/stage1_validation_predictions.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eab5701f117dbe4f1bd0755d870e1f5518e9fb99eae505c459af3576534fb843
+size 15007

checkpoints/cache/train_ecfp.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfc0c773eb523d256b84fec1cc314f290bcc5b9288073aff576712437450bf5a
+size 48726858

checkpoints/cache/validation_ecfp.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52f576b159c6e09d1ca22f0e260256cd46deacb137dfe77f127c756117671830
+size 1224294

checkpoints/metrics_stage1.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "NR-AhR": {
+    "val_auc": 0.8789764868603043,
+    "n_train_samples": 8165,
+    "n_val_samples": 271
+  },
+  "NR-AR": {
+    "val_auc": 0.8547453703703702,
+    "n_train_samples": 9358,
+    "n_val_samples": 291
+  },
+  "NR-AR-LBD": {
+    "val_auc": 0.9717741935483871,
+    "n_train_samples": 8595,
+    "n_val_samples": 252
+  },
+  "NR-Aromatase": {
+    "val_auc": 0.8483560090702948,
+    "n_train_samples": 7222,
+    "n_val_samples": 214
+  },
+  "NR-ER": {
+    "val_auc": 0.7974683544303797,
+    "n_train_samples": 7694,
+    "n_val_samples": 264
+  },
+  "NR-ER-LBD": {
+    "val_auc": 0.8347826086956521,
+    "n_train_samples": 8749,
+    "n_val_samples": 286
+  },
+  "NR-PPAR-gamma": {
+    "val_auc": 0.8077025232403718,
+    "n_train_samples": 8180,
+    "n_val_samples": 266
+  },
+  "SR-ARE": {
+    "val_auc": 0.8194921070693205,
+    "n_train_samples": 7165,
+    "n_val_samples": 233
+  },
+  "SR-ATAD5": {
+    "val_auc": 0.8749593495934959,
+    "n_train_samples": 9087,
+    "n_val_samples": 271
+  },
+  "SR-HSE": {
+    "val_auc": 0.92421875,
+    "n_train_samples": 8147,
+    "n_val_samples": 266
+  },
+  "SR-MMP": {
+    "val_auc": 0.9280613594287226,
+    "n_train_samples": 7317,
+    "n_val_samples": 237
+  },
+  "SR-p53": {
+    "val_auc": 0.8038690476190476,
+    "n_train_samples": 8630,
+    "n_val_samples": 268
+  }
+}

checkpoints/metrics_stage2.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "NR-AhR": {
+    "val_auc": 0.8740663900414938,
+    "best_iteration": 5
+  },
+  "NR-AR": {
+    "val_auc": 0.8854166666666666,
+    "best_iteration": 204
+  },
+  "NR-AR-LBD": {
+    "val_auc": 0.9828629032258065,
+    "best_iteration": 23
+  },
+  "NR-Aromatase": {
+    "val_auc": 0.8572845804988662,
+    "best_iteration": 8
+  },
+  "NR-ER": {
+    "val_auc": 0.8123144241287701,
+    "best_iteration": 2
+  },
+  "NR-ER-LBD": {
+    "val_auc": 0.857608695652174,
+    "best_iteration": 4
+  },
+  "NR-PPAR-gamma": {
+    "val_auc": 0.8739707835325365,
+    "best_iteration": 13
+  },
+  "SR-ARE": {
+    "val_auc": 0.857698467169984,
+    "best_iteration": 4
+  },
+  "SR-ATAD5": {
+    "val_auc": 0.8530081300813008,
+    "best_iteration": 145
+  },
+  "SR-HSE": {
+    "val_auc": 0.9201171875,
+    "best_iteration": 2
+  },
+  "SR-MMP": {
+    "val_auc": 0.9312351229833377,
+    "best_iteration": 395
+  },
+  "SR-p53": {
+    "val_auc": 0.8722470238095239,
+    "best_iteration": 3
+  }
+}

checkpoints/stage1/NR-AR-LBD.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d1b4030d07f948b2b549fc6426587a05c269648fbc40131d6e99241c8972d28
+size 627492

checkpoints/stage1/NR-AR.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:901eb451790a2607cc3e3c6eb6328f947899b557ec9887adc2481aea5a95433e
+size 24772

checkpoints/stage1/NR-AhR.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d2afa34f2143ab1d38247ae82d9c859ea74c37799134c571e30a3ec3e0cfe10
+size 1001188

checkpoints/stage1/NR-Aromatase.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf3aef5acc8411f33bbfe184a45ecb8b292eee93260f1b688a82e2ba26e83ae6
+size 428964

checkpoints/stage1/NR-ER-LBD.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d42aa1195696a8bab89cf0a6ad5914cbfb198d8e946d316dbec3df22621181b
+size 44628

checkpoints/stage1/NR-ER.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8be3a13fc68d2c96b827dab8771026de3944e400150b3b173cc77aa23eec328e
+size 113444

checkpoints/stage1/NR-PPAR-gamma.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c6cbc9dce92de09a74df3df581de742044f1ae7d522a5018d8006474d089768d
+size 58420

checkpoints/stage1/SR-ARE.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47fc92884f6ed03d8738f58b809a6c616d13dc7322249008941fbd2506f834bf
+size 392756

checkpoints/stage1/SR-ATAD5.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d00260b6422786faed1efafb99deca01be4021bc7bf4b6da8e807beacac6e14a
+size 375092

checkpoints/stage1/SR-HSE.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7b67c8aba900b7036a4d41e1171d3dabee2423fb1a08a26accf5b3d533065da
+size 237860

checkpoints/stage1/SR-MMP.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:528a7efd2fe9748037da657d88959821f994dfbe35cdbec0a4c415bd3a9ed898
+size 606164

checkpoints/stage1/SR-p53.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c2a2dceb70d39f5d72cf68f104be68c763dcf25da33b41bd45abcbcc2e07906
+size 1101540

checkpoints/stage1_params.json ADDED Viewed

	@@ -0,0 +1,242 @@

+{
+  "NR-AhR": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.04596965586436891,
+    "num_leaves": 26,
+    "max_depth": 11,
+    "min_child_samples": 14,
+    "feature_fraction": 0.9910514912963474,
+    "bagging_fraction": 0.5143055600954589,
+    "bagging_freq": 10,
+    "reg_alpha": 1.514871105815587e-07,
+    "reg_lambda": 1.0837426184192575e-05,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 321,
+    "val_auc": 0.8789764868603043
+  },
+  "NR-AR": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.17083936308641903,
+    "num_leaves": 80,
+    "max_depth": 3,
+    "min_child_samples": 91,
+    "feature_fraction": 0.9278238464266182,
+    "bagging_fraction": 0.6395695127332895,
+    "bagging_freq": 7,
+    "reg_alpha": 5.1826386659948485,
+    "reg_lambda": 0.1806421496400551,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 1,
+    "val_auc": 0.8547453703703702
+  },
+  "NR-AR-LBD": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.1992149524787967,
+    "num_leaves": 180,
+    "max_depth": 12,
+    "min_child_samples": 49,
+    "feature_fraction": 0.5650870342995183,
+    "bagging_fraction": 0.5114405817018007,
+    "bagging_freq": 2,
+    "reg_alpha": 0.00012622761264931666,
+    "reg_lambda": 1.2770493065015112e-08,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 210,
+    "val_auc": 0.9717741935483871
+  },
+  "NR-Aromatase": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.07977276592360394,
+    "num_leaves": 225,
+    "max_depth": 10,
+    "min_child_samples": 10,
+    "feature_fraction": 0.9900155440745377,
+    "bagging_fraction": 0.8119652252471269,
+    "bagging_freq": 2,
+    "reg_alpha": 2.594339538659424e-06,
+    "reg_lambda": 5.04193406028159e-07,
+    "n_estimators": 1000,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 62,
+    "val_auc": 0.8483560090702948
+  },
+  "NR-ER": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.011154410439266932,
+    "num_leaves": 193,
+    "max_depth": 0,
+    "min_child_samples": 13,
+    "feature_fraction": 0.976937312627059,
+    "bagging_fraction": 0.8101945461275061,
+    "bagging_freq": 10,
+    "reg_alpha": 3.712452767192828e-05,
+    "reg_lambda": 2.4101531861104065e-07,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 4,
+    "val_auc": 0.7974683544303797
+  },
+  "NR-ER-LBD": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.002057546388429118,
+    "num_leaves": 46,
+    "max_depth": 7,
+    "min_child_samples": 183,
+    "feature_fraction": 0.9180640687080166,
+    "bagging_fraction": 0.5266816009973788,
+    "bagging_freq": 2,
+    "reg_alpha": 0.4206611876821779,
+    "reg_lambda": 0.024870165101451694,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 18,
+    "val_auc": 0.8347826086956521
+  },
+  "NR-PPAR-gamma": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.05900664732601199,
+    "num_leaves": 235,
+    "max_depth": 10,
+    "min_child_samples": 73,
+    "feature_fraction": 0.9286215997058814,
+    "bagging_fraction": 0.6099533117515117,
+    "bagging_freq": 8,
+    "reg_alpha": 0.0002410679912248092,
+    "reg_lambda": 2.2374585644452814e-05,
+    "n_estimators": 50,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 16,
+    "val_auc": 0.8077025232403718
+  },
+  "SR-ARE": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.11683900433079997,
+    "num_leaves": 25,
+    "max_depth": 6,
+    "min_child_samples": 121,
+    "feature_fraction": 0.9155819290034379,
+    "bagging_fraction": 0.6064835300540737,
+    "bagging_freq": 6,
+    "reg_alpha": 0.031558826962596216,
+    "reg_lambda": 0.49750603290125384,
+    "n_estimators": 1000,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 313,
+    "val_auc": 0.8194921070693205
+  },
+  "SR-ATAD5": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.1398668165886807,
+    "num_leaves": 71,
+    "max_depth": 12,
+    "min_child_samples": 96,
+    "feature_fraction": 0.86899957688569,
+    "bagging_fraction": 0.7807007020967096,
+    "bagging_freq": 10,
+    "reg_alpha": 1.010171177179027e-08,
+    "reg_lambda": 2.3747514557444565e-07,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 126,
+    "val_auc": 0.8749593495934959
+  },
+  "SR-HSE": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.15091207817136804,
+    "num_leaves": 246,
+    "max_depth": 0,
+    "min_child_samples": 19,
+    "feature_fraction": 0.7867053613239711,
+    "bagging_fraction": 0.7013484568271124,
+    "bagging_freq": 9,
+    "reg_alpha": 0.0006360962863973946,
+    "reg_lambda": 6.440534124809522e-05,
+    "n_estimators": 500,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 14,
+    "val_auc": 0.92421875
+  },
+  "SR-MMP": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.19884775296113416,
+    "num_leaves": 18,
+    "max_depth": 11,
+    "min_child_samples": 88,
+    "feature_fraction": 0.9550517881195121,
+    "bagging_fraction": 0.7860517484953123,
+    "bagging_freq": 1,
+    "reg_alpha": 0.030296963787402084,
+    "reg_lambda": 0.8044239737357854,
+    "n_estimators": 1000,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 284,
+    "val_auc": 0.9280613594287226
+  },
+  "SR-p53": {
+    "objective": "binary",
+    "metric": "auc",
+    "verbosity": -1,
+    "learning_rate": 0.1655358096626077,
+    "num_leaves": 85,
+    "max_depth": 0,
+    "min_child_samples": 87,
+    "feature_fraction": 0.8995086723685061,
+    "bagging_fraction": 0.9621288945710826,
+    "bagging_freq": 6,
+    "reg_alpha": 0.001186616750567751,
+    "reg_lambda": 0.00030749373152708483,
+    "n_estimators": 1000,
+    "boosting_type": "gbdt",
+    "n_jobs": -1,
+    "random_state": 42,
+    "best_iteration": 132,
+    "val_auc": 0.8038690476190476
+  }
+}

checkpoints/stage2/NR-AR-LBD.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b877924ff923463aeb3d9347cc2d64077de0fd1e86c5ebceb1d6e85e7748cbba
+size 71988

checkpoints/stage2/NR-AR.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02e93424e24eb7a4fb704a191ad6d68c05145ab333d434c089e5bb98ff94fb9c
+size 1836148

checkpoints/stage2/NR-AhR.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6681c8042c04542aed22e3c7e1bc756d78cbd90035a1717050202e4841a32c40
+size 61716

checkpoints/stage2/NR-Aromatase.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58d2b37062490cb61caf6b36ed765493fb17bd1805a2f9242e3c8dda058ed05f
+size 41540

checkpoints/stage2/NR-ER-LBD.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be0e41ee6f1e0463073f4eeb983bfb6f7bcedfba2174e18bab55118feec9ef0c
+size 37092

checkpoints/stage2/NR-ER.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5aaaec49348aef97d610e5b0dd01ea41728474a5c8fc5cee3e961f0b97a28a1f
+size 28884

checkpoints/stage2/NR-PPAR-gamma.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f95337de1dd3dd937e3e15c74ff80b9cdb4d33b445446c1cd1e4044bcc3cb21
+size 74644

checkpoints/stage2/SR-ARE.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9041bf8b713c4ec2793b67a41d2e302b9aa1f615625a2be53c12efdfdec1232c
+size 52196

checkpoints/stage2/SR-ATAD5.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aafc9da7d6dc67e322264cfdd34788b2aa509e91c9de58a3369ce366b89fbec6
+size 485092

checkpoints/stage2/SR-HSE.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:763d559e337eb24b837d7a0cb9a530baa6b3ac8eccd1195e68ba39fbb078d3cb
+size 30452

checkpoints/stage2/SR-MMP.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c8565440bb1a9284032924de804d10993efefd71ce00f92f4014cc11509f51a2
+size 344964

checkpoints/stage2/SR-p53.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f446318ecd2ac93f685664bf68a0866e23a9275d53a32199f6f74557577c2152
+size 30916

checkpoints/training_manifest.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "feature_config": {
+    "type": "ecfp",
+    "radius": 2,
+    "n_bits": 1024,
+    "use_counts": false,
+    "map4_dim": 1024,
+    "cache_dir": "./checkpoints/cache"
+  },
+  "target_names": [
+    "NR-AhR",
+    "NR-AR",
+    "NR-AR-LBD",
+    "NR-Aromatase",
+    "NR-ER",
+    "NR-ER-LBD",
+    "NR-PPAR-gamma",
+    "SR-ARE",
+    "SR-ATAD5",
+    "SR-HSE",
+    "SR-MMP",
+    "SR-p53"
+  ],
+  "dataset": {
+    "name": "ml-jku/tox21"
+  },
+  "stage1": {
+    "model_dir": "checkpoints/stage1",
+    "metrics": "checkpoints/metrics_stage1.json"
+  },
+  "stage2": {
+    "enabled": true,
+    "model_dir": "checkpoints/stage2",
+    "metrics": "checkpoints/metrics_stage2.json"
+  },
+  "multitask": {
+    "enabled": true,
+    "prediction_source": "oof"
+  },
+  "seed": 42
+}

config/config.json CHANGED Viewed

@@ -12,8 +12,8 @@
     "cache_dir": "./checkpoints/cache"
   },
   "training": {
-    "optuna_trials": 40,
-    "boosting_rounds": 1500,
     "early_stopping_rounds": 100,
     "lightgbm_params": {
       "objective": "binary",

     "cache_dir": "./checkpoints/cache"
   },
   "training": {
+    "optuna_trials": 1000,
+    "n_estimators": [50, 500, 1000],
     "early_stopping_rounds": 100,
     "lightgbm_params": {
       "objective": "binary",

requirements.txt CHANGED Viewed

@@ -11,3 +11,4 @@ lightgbm
 optuna
 joblib
 map4

 optuna
 joblib
 map4
+tqdm

src/lightgbm_trainer.py CHANGED Viewed

@@ -11,6 +11,7 @@ import numpy as np
 import optuna
 import pandas as pd
 from sklearn.metrics import roc_auc_score
 from .constants import TARGET_NAMES
@@ -23,7 +24,29 @@ class TaskTrainingOutput:
     best_params: Dict
-def _sample_hyperparams(trial: optuna.Trial, base_params: Dict) -> Dict:
     params = dict(base_params)
     params.update(
         {
@@ -36,6 +59,7 @@ def _sample_hyperparams(trial: optuna.Trial, base_params: Dict) -> Dict:
             "bagging_freq": trial.suggest_int("bagging_freq", 1, 10),
             "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
             "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
         }
     )
     params.setdefault("objective", "binary")
@@ -52,7 +76,7 @@ def train_lightgbm_task(
     X_val: np.ndarray,
     y_val: np.ndarray,
     base_params: Dict,
-    boosting_rounds: int,
     early_stopping_rounds: int,
     n_trials: int,
     seed: int,
@@ -61,8 +85,7 @@ def train_lightgbm_task(
         return None
     def objective(trial: optuna.Trial) -> float:
-        params = _sample_hyperparams(trial, base_params)
-        params["n_estimators"] = boosting_rounds
         params["random_state"] = seed
         model = lgb.LGBMClassifier(**params)
         model.fit(
@@ -77,17 +100,15 @@ def train_lightgbm_task(
                     verbose=False,
                 )
             ],
-            verbose=False,
         )
-        best_iter = getattr(model, "best_iteration_", boosting_rounds)
         preds = model.predict_proba(X_val, num_iteration=best_iter)[:, 1]
         return float(roc_auc_score(y_val, preds))
     study = optuna.create_study(direction="maximize")
     study.optimize(objective, n_trials=n_trials, show_progress_bar=False)
-    best_params = _sample_hyperparams(study.best_trial, base_params)
-    best_params["n_estimators"] = boosting_rounds
     best_params["random_state"] = seed
     final_model = lgb.LGBMClassifier(**best_params)
@@ -103,10 +124,9 @@ def train_lightgbm_task(
                 verbose=False,
             )
         ],
-        verbose=False,
     )
-    best_iteration = getattr(final_model, "best_iteration_", boosting_rounds)
     val_preds = final_model.predict_proba(X_val, num_iteration=best_iteration)[:, 1]
     val_auc = roc_auc_score(y_val, val_preds)
@@ -139,12 +159,13 @@ def train_stage_one_models(
     training_cfg = config.get("training", {})
     base_params = training_cfg.get("lightgbm_params", {})
     n_trials = training_cfg.get("optuna_trials", 40)
-    boosting_rounds = training_cfg.get("boosting_rounds", 1500)
     early_stopping = training_cfg.get("early_stopping_rounds", 100)
     seed = config.get("seed", 42)
     n_train = len(train_df)
-    n_tasks = len(target_names)
     train_preds = np.full((n_train, n_tasks), 0.5, dtype=np.float32)
     val_preds = (
@@ -156,72 +177,74 @@ def train_stage_one_models(
     metrics: Dict[str, Dict] = {}
     params_dump: Dict[str, Dict] = {}
-    for task_idx, task_name in enumerate(target_names):
-        train_mask = train_df[task_name].notna().values
-        if val_df is None or val_features is None:
-            metrics[task_name] = {"status": "skipped", "reason": "missing validation split"}
-            continue
-        val_mask = val_df[task_name].notna().values
-        if train_mask.sum() < 2 or val_mask.sum() < 2:
-            metrics[task_name] = {"status": "skipped", "reason": "insufficient labeled data"}
-            continue
-        X_train_task = train_features[train_mask]
-        y_train_task = train_df.loc[train_mask, task_name].astype(float).values
-        X_val_task = val_features[val_mask]
-        y_val_task = val_df.loc[val_mask, task_name].astype(float).values
-        if len(np.unique(y_train_task)) < 2 or len(np.unique(y_val_task)) < 2:
-            metrics[task_name] = {"status": "skipped", "reason": "single-class labels"}
-            continue
-        task_result = train_lightgbm_task(
-            X_train_task,
-            y_train_task,
-            X_val_task,
-            y_val_task,
-            base_params=base_params,
-            boosting_rounds=boosting_rounds,
-            early_stopping_rounds=early_stopping,
-            n_trials=n_trials,
-            seed=seed,
-        )
-        if task_result is None:
-            metrics[task_name] = {"status": "skipped", "reason": "training failed"}
-            continue
-        model = task_result.model
-        best_iter = task_result.best_iteration
-        model_path = stage_dir / f"{task_name}.pkl"
-        joblib.dump(model, model_path)
-        params_dump[task_name] = {
-            **task_result.best_params,
-            "best_iteration": best_iter,
-            "val_auc": task_result.val_auc,
-        }
-        full_train_preds = model.predict_proba(
-            train_features,
-            num_iteration=best_iter,
-        )[:, 1]
-        train_preds[:, task_idx] = full_train_preds.astype(np.float32)
-        if val_preds is not None:
-            full_val_preds = model.predict_proba(
-                val_features,
                 num_iteration=best_iter,
             )[:, 1]
-            val_preds[:, task_idx] = full_val_preds.astype(np.float32)
-        metrics[task_name] = {
-            "val_auc": task_result.val_auc,
-            "n_train_samples": int(train_mask.sum()),
-            "n_val_samples": int(val_mask.sum()),
-        }
     save_stage_metrics(metrics, checkpoint_dir / "metrics_stage1.json")
     params_path = checkpoint_dir / "stage1_params.json"

 import optuna
 import pandas as pd
 from sklearn.metrics import roc_auc_score
+from tqdm import tqdm
 from .constants import TARGET_NAMES
     best_params: Dict
+def resolve_n_estimators(training_cfg: Dict) -> Sequence[int]:
+    """Normalize the n_estimators config entry into a non-empty list of ints."""
+    if "n_estimators" in training_cfg:
+        raw_value = training_cfg["n_estimators"]
+    elif "boosting_rounds" in training_cfg:
+        raw_value = training_cfg["boosting_rounds"]
+    else:
+        raw_value = [50, 500, 1000]
+    if isinstance(raw_value, int):
+        choices = [int(raw_value)]
+    elif isinstance(raw_value, Sequence) and not isinstance(raw_value, (str, bytes)):
+        choices = [int(v) for v in raw_value]
+    else:
+        raise ValueError("training.n_estimators must be an int or a sequence of ints")
+    choices = [v for v in choices if v > 0]
+    if not choices:
+        raise ValueError("training.n_estimators must contain at least one positive value")
+    return choices
+def _sample_hyperparams(trial: optuna.Trial, base_params: Dict, n_estimators_choices: Sequence[int]) -> Dict:
     params = dict(base_params)
     params.update(
         {
             "bagging_freq": trial.suggest_int("bagging_freq", 1, 10),
             "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
             "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
+            "n_estimators": trial.suggest_categorical("n_estimators", list(n_estimators_choices)),
         }
     )
     params.setdefault("objective", "binary")
     X_val: np.ndarray,
     y_val: np.ndarray,
     base_params: Dict,
+    n_estimators_choices: Sequence[int],
     early_stopping_rounds: int,
     n_trials: int,
     seed: int,
         return None
     def objective(trial: optuna.Trial) -> float:
+        params = _sample_hyperparams(trial, base_params, n_estimators_choices)
         params["random_state"] = seed
         model = lgb.LGBMClassifier(**params)
         model.fit(
                     verbose=False,
                 )
             ],
         )
+        best_iter = getattr(model, "best_iteration_", params["n_estimators"])
         preds = model.predict_proba(X_val, num_iteration=best_iter)[:, 1]
         return float(roc_auc_score(y_val, preds))
     study = optuna.create_study(direction="maximize")
     study.optimize(objective, n_trials=n_trials, show_progress_bar=False)
+    best_params = _sample_hyperparams(study.best_trial, base_params, n_estimators_choices)
     best_params["random_state"] = seed
     final_model = lgb.LGBMClassifier(**best_params)
                 verbose=False,
             )
         ],
     )
+    best_iteration = getattr(final_model, "best_iteration_", best_params["n_estimators"])
     val_preds = final_model.predict_proba(X_val, num_iteration=best_iteration)[:, 1]
     val_auc = roc_auc_score(y_val, val_preds)
     training_cfg = config.get("training", {})
     base_params = training_cfg.get("lightgbm_params", {})
     n_trials = training_cfg.get("optuna_trials", 40)
+    n_estimators_choices = resolve_n_estimators(training_cfg)
     early_stopping = training_cfg.get("early_stopping_rounds", 100)
     seed = config.get("seed", 42)
+    task_list = list(target_names)
     n_train = len(train_df)
+    n_tasks = len(task_list)
     train_preds = np.full((n_train, n_tasks), 0.5, dtype=np.float32)
     val_preds = (
     metrics: Dict[str, Dict] = {}
     params_dump: Dict[str, Dict] = {}
+    with tqdm(task_list, desc="Stage 1", unit="task") as progress_bar:
+        for task_idx, task_name in enumerate(progress_bar):
+            progress_bar.set_postfix(task=task_name)
+            train_mask = train_df[task_name].notna().values
+            if val_df is None or val_features is None:
+                metrics[task_name] = {"status": "skipped", "reason": "missing validation split"}
+                continue
+            val_mask = val_df[task_name].notna().values
+            if train_mask.sum() < 2 or val_mask.sum() < 2:
+                metrics[task_name] = {"status": "skipped", "reason": "insufficient labeled data"}
+                continue
+            X_train_task = train_features[train_mask]
+            y_train_task = train_df.loc[train_mask, task_name].astype(float).values
+            X_val_task = val_features[val_mask]
+            y_val_task = val_df.loc[val_mask, task_name].astype(float).values
+            if len(np.unique(y_train_task)) < 2 or len(np.unique(y_val_task)) < 2:
+                metrics[task_name] = {"status": "skipped", "reason": "single-class labels"}
+                continue
+            task_result = train_lightgbm_task(
+                X_train_task,
+                y_train_task,
+                X_val_task,
+                y_val_task,
+                base_params=base_params,
+                n_estimators_choices=n_estimators_choices,
+                early_stopping_rounds=early_stopping,
+                n_trials=n_trials,
+                seed=seed,
+            )
+            if task_result is None:
+                metrics[task_name] = {"status": "skipped", "reason": "training failed"}
+                continue
+            model = task_result.model
+            best_iter = task_result.best_iteration
+            model_path = stage_dir / f"{task_name}.pkl"
+            joblib.dump(model, model_path)
+            params_dump[task_name] = {
+                **task_result.best_params,
+                "best_iteration": best_iter,
+                "val_auc": task_result.val_auc,
+            }
+            full_train_preds = model.predict_proba(
+                train_features,
                 num_iteration=best_iter,
             )[:, 1]
+            train_preds[:, task_idx] = full_train_preds.astype(np.float32)
+            if val_preds is not None:
+                full_val_preds = model.predict_proba(
+                    val_features,
+                    num_iteration=best_iter,
+                )[:, 1]
+                val_preds[:, task_idx] = full_val_preds.astype(np.float32)
+            metrics[task_name] = {
+                "val_auc": task_result.val_auc,
+                "n_train_samples": int(train_mask.sum()),
+                "n_val_samples": int(val_mask.sum()),
+            }
     save_stage_metrics(metrics, checkpoint_dir / "metrics_stage1.json")
     params_path = checkpoint_dir / "stage1_params.json"

src/stage_two.py CHANGED Viewed

@@ -6,10 +6,10 @@ from typing import Dict, Optional, Sequence
 import joblib
 import numpy as np
 import pandas as pd
-from sklearn.metrics import roc_auc_score
 from .constants import TARGET_NAMES
-from .lightgbm_trainer import save_stage_metrics, train_lightgbm_task
 def _build_augmented_matrix(base_features: np.ndarray, prediction_matrix: np.ndarray, target_idx: int) -> np.ndarray:
@@ -32,75 +32,75 @@ def train_stage_two_models(
     training_cfg = config.get("training", {})
     base_params = training_cfg.get("lightgbm_params", {})
     n_trials = training_cfg.get("optuna_trials", 40)
-    boosting_rounds = training_cfg.get("boosting_rounds", 1500)
     early_stopping = training_cfg.get("early_stopping_rounds", 100)
     seed = config.get("seed", 42)
     stage_dir = checkpoint_dir / "stage2"
     stage_dir.mkdir(parents=True, exist_ok=True)
-    n_train = len(train_df)
-    n_val = len(val_df) if val_df is not None else 0
     metrics: Dict[str, Dict] = {}
-    for task_idx, task_name in enumerate(target_names):
-        mask = train_df[task_name].notna().values
-        if mask.sum() == 0:
-            metrics[task_name] = {"status": "skipped", "reason": "no labels"}
-            continue
-        augmented_train_matrix = _build_augmented_matrix(
-            train_features[mask],
-            stage1_train_preds[mask],
-            task_idx,
-        )
-        y_train = train_df.loc[mask, task_name].astype(float).values
-        if (
-            val_features is None
-            or val_df is None
-            or stage1_val_preds is None
-            or val_df[task_name].notna().sum() < 2
-        ):
-            metrics[task_name] = {"status": "skipped", "reason": "missing validation data"}
-            continue
-        val_mask = val_df[task_name].notna().values
-        augmented_val_matrix = _build_augmented_matrix(
-            val_features[val_mask],
-            stage1_val_preds[val_mask],
-            task_idx,
-        )
-        y_val = val_df.loc[val_mask, task_name].astype(float).values
-        if len(np.unique(y_val)) < 2 or len(np.unique(y_train)) < 2:
-            metrics[task_name] = {"status": "skipped", "reason": "single-class labels"}
-            continue
-        task_result = train_lightgbm_task(
-            augmented_train_matrix,
-            y_train,
-            augmented_val_matrix,
-            y_val,
-            base_params=base_params,
-            boosting_rounds=boosting_rounds,
-            early_stopping_rounds=early_stopping,
-            n_trials=n_trials,
-            seed=seed,
-        )
-        if task_result is None:
-            metrics[task_name] = {"status": "skipped", "reason": "training failed"}
-            continue
-        model_path = stage_dir / f"{task_name}.pkl"
-        joblib.dump(task_result.model, model_path)
-        metrics[task_name] = {
-            "val_auc": task_result.val_auc,
-            "best_iteration": int(task_result.best_iteration),
-        }
     save_stage_metrics(metrics, checkpoint_dir / "metrics_stage2.json")
     return {"metrics": metrics}

 import joblib
 import numpy as np
 import pandas as pd
+from tqdm import tqdm
 from .constants import TARGET_NAMES
+from .lightgbm_trainer import resolve_n_estimators, save_stage_metrics, train_lightgbm_task
 def _build_augmented_matrix(base_features: np.ndarray, prediction_matrix: np.ndarray, target_idx: int) -> np.ndarray:
     training_cfg = config.get("training", {})
     base_params = training_cfg.get("lightgbm_params", {})
     n_trials = training_cfg.get("optuna_trials", 40)
+    n_estimators_choices = resolve_n_estimators(training_cfg)
     early_stopping = training_cfg.get("early_stopping_rounds", 100)
     seed = config.get("seed", 42)
     stage_dir = checkpoint_dir / "stage2"
     stage_dir.mkdir(parents=True, exist_ok=True)
     metrics: Dict[str, Dict] = {}
+    task_list = list(target_names)
+    with tqdm(task_list, desc="Stage 2", unit="task") as progress_bar:
+        for task_idx, task_name in enumerate(progress_bar):
+            progress_bar.set_postfix(task=task_name)
+            mask = train_df[task_name].notna().values
+            if mask.sum() == 0:
+                metrics[task_name] = {"status": "skipped", "reason": "no labels"}
+                continue
+            augmented_train_matrix = _build_augmented_matrix(
+                train_features[mask],
+                stage1_train_preds[mask],
+                task_idx,
+            )
+            y_train = train_df.loc[mask, task_name].astype(float).values
+            if (
+                val_features is None
+                or val_df is None
+                or stage1_val_preds is None
+                or val_df[task_name].notna().sum() < 2
+            ):
+                metrics[task_name] = {"status": "skipped", "reason": "missing validation data"}
+                continue
+            val_mask = val_df[task_name].notna().values
+            augmented_val_matrix = _build_augmented_matrix(
+                val_features[val_mask],
+                stage1_val_preds[val_mask],
+                task_idx,
+            )
+            y_val = val_df.loc[val_mask, task_name].astype(float).values
+            if len(np.unique(y_val)) < 2 or len(np.unique(y_train)) < 2:
+                metrics[task_name] = {"status": "skipped", "reason": "single-class labels"}
+                continue
+            task_result = train_lightgbm_task(
+                augmented_train_matrix,
+                y_train,
+                augmented_val_matrix,
+                y_val,
+                base_params=base_params,
+                n_estimators_choices=n_estimators_choices,
+                early_stopping_rounds=early_stopping,
+                n_trials=n_trials,
+                seed=seed,
+            )
+            if task_result is None:
+                metrics[task_name] = {"status": "skipped", "reason": "training failed"}
+                continue
+            model_path = stage_dir / f"{task_name}.pkl"
+            joblib.dump(task_result.model, model_path)
+            metrics[task_name] = {
+                "val_auc": task_result.val_auc,
+                "best_iteration": int(task_result.best_iteration),
+            }
     save_stage_metrics(metrics, checkpoint_dir / "metrics_stage2.json")
     return {"metrics": metrics}