Spaces:

rasayan-labs
/

rasayan-tox21

Sleeping

Missing training code

by antoniaebner - opened Feb 5

Feb 5

Thank you for your submission to the Tox21 Leaderboard!

To complete the review process, we need to verify your code and reproduce the trained model. We kindly ask you to upload the training code used for your submission.

antoniaebner changed discussion title from Add training code to Missing training code Feb 5

aarshit-mittal

Rasayan Labs Inc. org Feb 5

Hi @antoniaebner ,
Thank you for reviewing our submission!

We have uploaded the complete training code to the model repository:
https://huggingface.co/rasayan-labs/rasayan-tox21-snn/tree/main/training

The training/ folder contains:

train.py - Main 40-fold cross-validation training script
create_ensemble.py - Script to create top-10 ensemble from fold

checkpoints

src/ - Model architecture, trainer, and feature extraction code
TRAIN.md - Detailed instructions for reproducing the model

Please let us know if you need any additional information.
Best regards,
Rasayan Labs

antoniaebner

Feb 5

Thank you for the quick response and upload!
We will inform you as soon as we have finished reviewing your submission.

Best regards,
Antonia Ebner

antoniaebner

Feb 5

I encountered some problems with your 2nd step in "Data Preparation", where I, for example, obtained only 11,369 features instead of your described 11,377. Was this a typo in your description?
Furthermore I kindly ask you to:

add a download link to the specific dataset you used for training your model
extend enhanced_preprocess.py to include the complete preprocessing pipeline of the downloaded data as outlined in TRAIN.md (e.g. also the split into X_orig.npy, labels_orig.npy, etc.)

Thank you and best regards,
Antonia Ebner

aarshit-mittal

Rasayan Labs Inc. org Feb 5

Thank you for testing the reproduction pipeline.

Feature Count (11,377 vs 11,369)
11,377 is not a typo. The exact count depends on RDKit version, as
FilterCatalog entry counts vary between releases. Our breakdown with
RDKit 2025.09.3:
┌───────────────────┬────────┐
│ Feature │ Count │

├───────────────────┼────────┤
│ ECFP6 │ 8,192 │
├───────────────────┼────────┤
│ MACCS │ 167 │
├───────────────────┼────────┤
│ RDKit descriptors │ 208 │
├───────────────────┼────────┤
│ Toxicophores │ 1,868 │
├───────────────────┼────────┤
│ RDKit filters │ 815 │
├───────────────────┼────────┤
│ Similarity │ 41 │
├───────────────────┼────────┤
│ Max similarity │ 12 │
├───────────────────┼────────┤
│ DB similarity │ 74 │
├───────────────────┼────────┤
│ Total │ 11,377 │
└───────────────────┴────────┘
The 8-feature difference is likely from different filter catalog sizes
in your RDKit version.
2. Dataset Download
MoleculeNet Tox21 (7,831 compounds):

https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21

Complete Preprocessing Script
The complete pipeline is preprocess.py (not enhanced_preprocess.py,
which is a utility module). Running:
python preprocess.py --data_dir ./data --output_dir ./features
Creates X_orig.npy, labels_orig.npy, mask_orig.npy, and valid_orig.npy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment