Hari5115 commited on
Commit
1b65bf7
·
verified ·
1 Parent(s): cc82328

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - chemistry
6
+ - toxicology
7
+ - molecular-property-prediction
8
+ - tox21
9
+ - pytorch
10
+ datasets:
11
+ - scikit-fingerprints/MoleculeNet_Tox21
12
+ metrics:
13
+ - roc_auc
14
+ ---
15
+
16
+ # Molecular Toxicity Predictor
17
+
18
+ Multi-task neural network that screens small molecules across the **Tox21** panel of 12 in-vitro toxicity assays.
19
+
20
+ ## Model Description
21
+
22
+ - **Architecture:** Multi-task MLP — 2048 → 1024 → 512 → 12
23
+ - **Input:** Morgan fingerprints (ECFP4, radius=2, 2048 bits) via RDKit
24
+ - **Output:** Probability of activity for each of 12 toxicity assays
25
+ - **Loss:** Masked binary cross-entropy (NaN labels are excluded per task)
26
+ - **Metric:** Mean AUC-ROC across tasks
27
+
28
+ ## Tox21 Assays
29
+
30
+ | Task | Target |
31
+ |------|--------|
32
+ | NR-AR | Androgen receptor |
33
+ | NR-AR-LBD | Androgen receptor ligand-binding domain |
34
+ | NR-AhR | Aryl hydrocarbon receptor |
35
+ | NR-Aromatase | Aromatase enzyme inhibition |
36
+ | NR-ER | Estrogen receptor alpha |
37
+ | NR-ER-LBD | Estrogen receptor LBD |
38
+ | NR-PPAR-gamma | Peroxisome proliferator-activated receptor gamma |
39
+ | SR-ARE | Antioxidant response element |
40
+ | SR-ATAD5 | Genotoxicity / DNA damage (ATAD5 reporter) |
41
+ | SR-HSE | Heat shock response element |
42
+ | SR-MMP | Mitochondrial membrane potential |
43
+ | SR-p53 | p53 tumour suppressor / DNA damage |
44
+
45
+ ## Training Data
46
+
47
+ - **Dataset:** [Tox21 via MoleculeNet](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21)
48
+ - **Train:** 6,264 molecules | **Val:** 783 | **Test:** 784 (80/10/10 split from 7,831 total)
49
+ - Labels are sparse — ~17% NaN on average; untested entries are treated as missing, not negative
50
+
51
+ ## Usage
52
+
53
+ ```python
54
+ import json
55
+ import numpy as np
56
+ import torch
57
+ import torch.nn as nn
58
+ from rdkit import Chem
59
+ from rdkit.Chem import rdFingerprintGenerator
60
+ from huggingface_hub import hf_hub_download
61
+
62
+ TASKS = [
63
+ "NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase",
64
+ "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma",
65
+ "SR-ARE", "SR-ATAD5", "SR-HSE", "SR-MMP", "SR-p53",
66
+ ]
67
+
68
+ class ToxMLP(nn.Module):
69
+ def __init__(self, n_inputs=2048, n_outputs=12, dropout=0.3):
70
+ super().__init__()
71
+ self.net = nn.Sequential(
72
+ nn.Linear(n_inputs, 1024), nn.BatchNorm1d(1024), nn.ReLU(), nn.Dropout(dropout),
73
+ nn.Linear(1024, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
74
+ nn.Linear(512, n_outputs),
75
+ )
76
+ def forward(self, x):
77
+ return self.net(x)
78
+
79
+ model_path = hf_hub_download("Hari5115/molecular-toxicity-predictor", "best_model.pt")
80
+ model = ToxMLP()
81
+ model.load_state_dict(torch.load(model_path, map_location="cpu"))
82
+ model.eval()
83
+
84
+ fp_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
85
+ mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O") # aspirin
86
+ fp = torch.tensor([list(fp_gen.GetFingerprint(mol))], dtype=torch.float32)
87
+
88
+ with torch.no_grad():
89
+ probs = torch.sigmoid(model(fp)).squeeze().numpy()
90
+
91
+ for task, p in zip(TASKS, probs):
92
+ print(f"{task}: {p:.1%}")
93
+ ```
94
+
95
+ ## Results
96
+
97
+ Test set performance (mean AUC-ROC: **0.8170**):
98
+
99
+ | Task | AUC-ROC |
100
+ |------|---------|
101
+ | NR-AR | 0.752 |
102
+ | NR-AR-LBD | 0.890 |
103
+ | NR-AhR | 0.892 |
104
+ | NR-Aromatase | 0.792 |
105
+ | NR-ER | 0.751 |
106
+ | NR-ER-LBD | 0.808 |
107
+ | NR-PPAR-gamma | 0.742 |
108
+ | SR-ARE | 0.803 |
109
+ | SR-ATAD5 | 0.847 |
110
+ | SR-HSE | 0.811 |
111
+ | SR-MMP | 0.847 |
112
+ | SR-p53 | 0.868 |
113
+
114
+ Random Forest baseline (val): 0.8306 — MLP val: **0.8544**
115
+
116
+ ## Limitations
117
+
118
+ - Morgan fingerprints capture local chemical structure but miss 3D conformation and long-range interactions.
119
+ - The model is trained on a relatively small dataset (~6,000 molecules); extrapolation to novel chemical classes may be unreliable.
120
+ - **For research and educational purposes only. Not a substitute for certified toxicological testing.**
121
+
122
+ ## Dataset Credit
123
+
124
+ Tox21 data provided by the NIH National Center for Advancing Translational Sciences (NCATS).
125
+ Accessed via [scikit-fingerprints/MoleculeNet_Tox21](https://huggingface.co/datasets/scikit-fingerprints/MoleculeNet_Tox21) on HuggingFace.
126
+
127
+ ## License
128
+
129
+ MIT