Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- chemistry
|
| 6 |
+
- toxicology
|
| 7 |
+
- molecular-property-prediction
|
| 8 |
+
- tox21
|
| 9 |
+
- pytorch
|
| 10 |
+
datasets:
|
| 11 |
+
- scikit-fingerprints/MoleculeNet_Tox21
|
| 12 |
+
metrics:
|
| 13 |
+
- roc_auc
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Molecular Toxicity Predictor
|
| 17 |
+
|
| 18 |
+
Multi-task neural network that screens small molecules across the **Tox21** panel of 12 in-vitro toxicity assays.
|
| 19 |
+
|
| 20 |
+
## Model Description
|
| 21 |
+
|
| 22 |
+
- **Architecture:** Multi-task MLP — 2048 → 1024 → 512 → 12
|
| 23 |
+
- **Input:** Morgan fingerprints (ECFP4, radius=2, 2048 bits) via RDKit
|
| 24 |
+
- **Output:** Probability of activity for each of 12 toxicity assays
|
| 25 |
+
- **Loss:** Masked binary cross-entropy (NaN labels are excluded per task)
|
| 26 |
+
- **Metric:** Mean AUC-ROC across tasks
|
| 27 |
+
|
| 28 |
+
## Tox21 Assays
|
| 29 |
+
|
| 30 |
+
| Task | Target |
|
| 31 |
+
|------|--------|
|
| 32 |
+
| NR-AR | Androgen receptor |
|
| 33 |
+
| NR-AR-LBD | Androgen receptor ligand-binding domain |
|
| 34 |
+
| NR-AhR | Aryl hydrocarbon receptor |
|
| 35 |
+
| NR-Aromatase | Aromatase enzyme inhibition |
|
| 36 |
+
| NR-ER | Estrogen receptor alpha |
|
| 37 |
+
| NR-ER-LBD | Estrogen receptor LBD |
|
| 38 |
+
| NR-PPAR-gamma | Peroxisome proliferator-activated receptor gamma |
|
| 39 |
+
| SR-ARE | Antioxidant response element |
|
| 40 |
+
| SR-ATAD5 | Genotoxicity / DNA damage (ATAD5 reporter) |
|
| 41 |
+
| SR-HSE | Heat shock response element |
|
| 42 |
+
| SR-MMP | Mitochondrial membrane potential |
|
| 43 |
+
| SR-p53 | p53 tumour suppressor / DNA damage |
|
| 44 |
+
|
| 45 |
+
## Training Data
|
| 46 |
+
|
| 47 |
+
- **Dataset:** [Tox21 via MoleculeNet](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21)
|
| 48 |
+
- **Train:** 6,264 molecules | **Val:** 783 | **Test:** 784 (80/10/10 split from 7,831 total)
|
| 49 |
+
- Labels are sparse — ~17% NaN on average; untested entries are treated as missing, not negative
|
| 50 |
+
|
| 51 |
+
## Usage
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
import json
|
| 55 |
+
import numpy as np
|
| 56 |
+
import torch
|
| 57 |
+
import torch.nn as nn
|
| 58 |
+
from rdkit import Chem
|
| 59 |
+
from rdkit.Chem import rdFingerprintGenerator
|
| 60 |
+
from huggingface_hub import hf_hub_download
|
| 61 |
+
|
| 62 |
+
TASKS = [
|
| 63 |
+
"NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase",
|
| 64 |
+
"NR-ER", "NR-ER-LBD", "NR-PPAR-gamma",
|
| 65 |
+
"SR-ARE", "SR-ATAD5", "SR-HSE", "SR-MMP", "SR-p53",
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
class ToxMLP(nn.Module):
|
| 69 |
+
def __init__(self, n_inputs=2048, n_outputs=12, dropout=0.3):
|
| 70 |
+
super().__init__()
|
| 71 |
+
self.net = nn.Sequential(
|
| 72 |
+
nn.Linear(n_inputs, 1024), nn.BatchNorm1d(1024), nn.ReLU(), nn.Dropout(dropout),
|
| 73 |
+
nn.Linear(1024, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
|
| 74 |
+
nn.Linear(512, n_outputs),
|
| 75 |
+
)
|
| 76 |
+
def forward(self, x):
|
| 77 |
+
return self.net(x)
|
| 78 |
+
|
| 79 |
+
model_path = hf_hub_download("Hari5115/molecular-toxicity-predictor", "best_model.pt")
|
| 80 |
+
model = ToxMLP()
|
| 81 |
+
model.load_state_dict(torch.load(model_path, map_location="cpu"))
|
| 82 |
+
model.eval()
|
| 83 |
+
|
| 84 |
+
fp_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
|
| 85 |
+
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O") # aspirin
|
| 86 |
+
fp = torch.tensor([list(fp_gen.GetFingerprint(mol))], dtype=torch.float32)
|
| 87 |
+
|
| 88 |
+
with torch.no_grad():
|
| 89 |
+
probs = torch.sigmoid(model(fp)).squeeze().numpy()
|
| 90 |
+
|
| 91 |
+
for task, p in zip(TASKS, probs):
|
| 92 |
+
print(f"{task}: {p:.1%}")
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Results
|
| 96 |
+
|
| 97 |
+
Test set performance (mean AUC-ROC: **0.8170**):
|
| 98 |
+
|
| 99 |
+
| Task | AUC-ROC |
|
| 100 |
+
|------|---------|
|
| 101 |
+
| NR-AR | 0.752 |
|
| 102 |
+
| NR-AR-LBD | 0.890 |
|
| 103 |
+
| NR-AhR | 0.892 |
|
| 104 |
+
| NR-Aromatase | 0.792 |
|
| 105 |
+
| NR-ER | 0.751 |
|
| 106 |
+
| NR-ER-LBD | 0.808 |
|
| 107 |
+
| NR-PPAR-gamma | 0.742 |
|
| 108 |
+
| SR-ARE | 0.803 |
|
| 109 |
+
| SR-ATAD5 | 0.847 |
|
| 110 |
+
| SR-HSE | 0.811 |
|
| 111 |
+
| SR-MMP | 0.847 |
|
| 112 |
+
| SR-p53 | 0.868 |
|
| 113 |
+
|
| 114 |
+
Random Forest baseline (val): 0.8306 — MLP val: **0.8544**
|
| 115 |
+
|
| 116 |
+
## Limitations
|
| 117 |
+
|
| 118 |
+
- Morgan fingerprints capture local chemical structure but miss 3D conformation and long-range interactions.
|
| 119 |
+
- The model is trained on a relatively small dataset (~6,000 molecules); extrapolation to novel chemical classes may be unreliable.
|
| 120 |
+
- **For research and educational purposes only. Not a substitute for certified toxicological testing.**
|
| 121 |
+
|
| 122 |
+
## Dataset Credit
|
| 123 |
+
|
| 124 |
+
Tox21 data provided by the NIH National Center for Advancing Translational Sciences (NCATS).
|
| 125 |
+
Accessed via [scikit-fingerprints/MoleculeNet_Tox21](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21) on HuggingFace.
|
| 126 |
+
|
| 127 |
+
## License
|
| 128 |
+
|
| 129 |
+
MIT
|