Spaces:
Running
Missing training code
Thank you for your submission to the Tox21 Leaderboard!
To complete the review process, we need to verify your code and reproduce the trained model. We kindly ask you to upload the training code used for your submission.
Hi
@antoniaebner
,
Thank you for reviewing our submission!
We have uploaded the complete training code to the model repository:
https://huggingface.co/rasayan-labs/rasayan-tox21-snn/tree/main/training
The training/ folder contains:
train.py- Main 40-fold cross-validation training scriptcreate_ensemble.py- Script to create top-10 ensemble from fold
checkpoints
src/- Model architecture, trainer, and feature extraction codeTRAIN.md- Detailed instructions for reproducing the model
Please let us know if you need any additional information.
Best regards,
Rasayan Labs
Thank you for the quick response and upload!
We will inform you as soon as we have finished reviewing your submission.
Best regards,
Antonia Ebner
I encountered some problems with your 2nd step in "Data Preparation", where I, for example, obtained only 11,369 features instead of your described 11,377. Was this a typo in your description?
Furthermore I kindly ask you to:
- add a download link to the specific dataset you used for training your model
- extend
enhanced_preprocess.pyto include the complete preprocessing pipeline of the downloaded data as outlined inTRAIN.md(e.g. also the split intoX_orig.npy,labels_orig.npy, etc.)
Thank you and best regards,
Antonia Ebner
Thank you for testing the reproduction pipeline.
- Feature Count (11,377 vs 11,369)
11,377 is not a typo. The exact count depends on RDKit version, as
FilterCatalog entry counts vary between releases. Our breakdown with
RDKit 2025.09.3:
┌───────────────────┬────────┐
│ Feature │ Count │
├───────────────────┼────────┤
│ ECFP6 │ 8,192 │
├───────────────────┼────────┤
│ MACCS │ 167 │
├───────────────────┼────────┤
│ RDKit descriptors │ 208 │
├───────────────────┼────────┤
│ Toxicophores │ 1,868 │
├───────────────────┼────────┤
│ RDKit filters │ 815 │
├───────────────────┼────────┤
│ Similarity │ 41 │
├───────────────────┼────────┤
│ Max similarity │ 12 │
├───────────────────┼────────┤
│ DB similarity │ 74 │
├───────────────────┼────────┤
│ Total │ 11,377 │
└───────────────────┴────────┘
The 8-feature difference is likely from different filter catalog sizes
in your RDKit version.
2. Dataset Download
MoleculeNet Tox21 (7,831 compounds):
- Complete Preprocessing Script
The complete pipeline is preprocess.py (not enhanced_preprocess.py,
which is a utility module). Running:
python preprocess.py --data_dir ./data --output_dir ./features
Creates X_orig.npy, labels_orig.npy, mask_orig.npy, and valid_orig.npy.