Missing training code

#1
by antoniaebner - opened

Thank you for your submission to the Tox21 Leaderboard!

To complete the review process, we need to verify your code and reproduce the trained model. We kindly ask you to upload the training code used for your submission.

antoniaebner changed discussion title from Add training code to Missing training code
Rasayan Labs Inc. org

Hi @antoniaebner ,
Thank you for reviewing our submission!

We have uploaded the complete training code to the model repository:
https://huggingface.co/rasayan-labs/rasayan-tox21-snn/tree/main/training

The training/ folder contains:

  • train.py - Main 40-fold cross-validation training script

  • create_ensemble.py - Script to create top-10 ensemble from fold

checkpoints

  • src/ - Model architecture, trainer, and feature extraction code

  • TRAIN.md - Detailed instructions for reproducing the model

Please let us know if you need any additional information.
Best regards,
Rasayan Labs

Thank you for the quick response and upload!
We will inform you as soon as we have finished reviewing your submission.

Best regards,
Antonia Ebner

I encountered some problems with your 2nd step in "Data Preparation", where I, for example, obtained only 11,369 features instead of your described 11,377. Was this a typo in your description?
Furthermore I kindly ask you to:

  1. add a download link to the specific dataset you used for training your model
  2. extend enhanced_preprocess.py to include the complete preprocessing pipeline of the downloaded data as outlined in TRAIN.md (e.g. also the split into X_orig.npy, labels_orig.npy, etc.)

Thank you and best regards,
Antonia Ebner

Rasayan Labs Inc. org

Thank you for testing the reproduction pipeline.

  1. Feature Count (11,377 vs 11,369)
    11,377 is not a typo. The exact count depends on RDKit version, as
    FilterCatalog entry counts vary between releases. Our breakdown with
    RDKit 2025.09.3:
    ┌───────────────────┬────────┐
    │ Feature │ Count │

├───────────────────┼────────┤
│ ECFP6 │ 8,192 │
├───────────────────┼────────┤
│ MACCS │ 167 │
├───────────────────┼────────┤
│ RDKit descriptors │ 208 │
├───────────────────┼────────┤
│ Toxicophores │ 1,868 │
├───────────────────┼────────┤
│ RDKit filters │ 815 │
├───────────────────┼────────┤
│ Similarity │ 41 │
├───────────────────┼────────┤
│ Max similarity │ 12 │
├───────────────────┼────────┤
│ DB similarity │ 74 │
├───────────────────┼────────┤
│ Total │ 11,377 │
└───────────────────┴────────┘
The 8-feature difference is likely from different filter catalog sizes
in your RDKit version.
2. Dataset Download
MoleculeNet Tox21 (7,831 compounds):

  1. Complete Preprocessing Script
    The complete pipeline is preprocess.py (not enhanced_preprocess.py,
    which is a utility module). Running:
    python preprocess.py --data_dir ./data --output_dir ./features
    Creates X_orig.npy, labels_orig.npy, mask_orig.npy, and valid_orig.npy.

Sign up or log in to comment