File size: 6,987 Bytes
937eff6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
license: cc-by-nc-4.0
license_name: cc-by-nc-4.0
license_link: https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS
library_name: pytorch
pipeline_tag: tabular-regression
tags:
- chemistry
- drug-discovery
- cardiotoxicity
- hERG
- ion-channels
- multi-task
- molecules
- QSAR
language:
- en
---
# CardioSafe — paper-snapshot weights
Paper-snapshot weights for **CardioSafe: multi-task prediction of cardiac
ion channel activity with reverse-leak audited benchmarking** (Jovanović
et al., 2026,
[bioRxiv](https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1)).
CardioSafe is a three-branch multi-task neural network that predicts
blocker status and pIC50 for the four CiPA cardiac ion channels — **hERG,
Nav1.5, Cav1.2, and (exploratory) IKs** — trained on the largest
publicly reported multi-channel cardiac ion channel dataset (ChEMBL 36 +
hERG Central, 334,444 curated compounds, 8 heads).
This HuggingFace repo is a mirror. The canonical home is
[github.com/AppliedScientific/CardioSafe-benchmark](https://github.com/AppliedScientific/CardioSafe-benchmark),
which ships the curated dataset, splits, supplementary materials, the
reverse-leak audit script, the reference model + training-step code, and
runnable inference (`inference/predict.py`). The continually-updated
deployed ensemble is served at
[platform.appliedscientific.ai/cardiosafe](https://platform.appliedscientific.ai/cardiosafe).
## Files
```
v1.0/ # preprint snapshot, 5-seed ensemble
cardiosafe_v1.0_seed_{42..46}.pt # 15 MB each
v1.1/ # audit-clean snapshot, 5-seed ensemble
cardiosafe_v1.1_seed_{42..46}.pt # 15 MB each — RECOMMENDED for new work
l1000/
l1000_encoder.pt # 10 MB — shared by v1.0 + v1.1
l1000_per_gene_pearson.json # per-gene test-set Pearson r (diagnostic)
```
Each `.pt` contains `model_state_dict`, descriptor / L1000 / regression-head
scalers, and a clean config dict. The L1000 encoder checkpoint additionally
contains the gene co-expression `edge_index` and per-gene scaler stats.
## v1.0 vs v1.1
- **v1.0** is the exact ensemble evaluated in the bioRxiv preprint.
- **v1.1** is an audit-clean retrain: the exhaustive
O(n_train × n_other) Tanimoto leakage audit flagged 12 train↔val edges
in tan70 v1.0 at Morgan-r2-2048 Tanimoto ≥ 0.70, all within the
canonical cardiac-cliff cluster (terfenadine / fexofenadine /
hydroxymethyl-terfenadine analogs). v1.1 force-routes the 2 HMT
analogs (rows 317153, 331406) to val so the cluster is fully
audit-clean.
- **Test fold is identical** between v1.0 and v1.1 — headline test
metrics (Tables 2 / 3 of the paper) are unchanged. v1.1 just gives an
audit-clean training set for the per-seed val fold selection.
- See [Note S3](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/data/supplementary/note_s3_v1_1_audit_correction.md)
for the full audit findings + re-evaluation of the cardiac-cliff case study.
**Use v1.1 for new work.** v1.0 is retained so the preprint numbers stay
reproducible.
## Inputs and outputs
The model expects a single flat `float32` tensor of shape `(B, 7526)`:
| dims | block | source |
| --- | --- | --- |
| 0 – 2047 | Morgan radius-2 2048-bit binary fingerprint | RDKit `GetMorganGenerator(radius=2, fpSize=2048)` |
| 2048 – 4095 | AtomPair 2048-bit binary fingerprint | RDKit `GetAtomPairGenerator(fpSize=2048)` |
| 4096 – 6143 | TopologicalTorsion 2048-bit binary fingerprint | RDKit `GetTopologicalTorsionGenerator(fpSize=2048)` |
| 6144 – 6163 | 20-descriptor block, training-fold z-scored | Spec in `data/supplementary/table_s0_descriptor_spec.*` |
| 6164 – 6547 | ChemBERTa-77M-MTR mean-pooled embedding (384) | `model/chemberta_encoder.py` |
| 6548 – 7525 | L1000 predicted expression z-scores (978) | `model/l1000_encoder.py` |
`forward(x)` returns a `dict[str, Tensor]` with 8 keys, each value a `(B,)` tensor:
| Head | Output | Channel |
| --- | --- | --- |
| `herg_pchembl` | regression — raw pIC50 | hERG |
| `herg_blocker_10um` | logit (apply sigmoid for P) | hERG |
| `herg_blocker_1um` | logit | hERG |
| `nav15_pchembl` | regression — raw pIC50 | Nav1.5 |
| `nav15_blocker` | logit | Nav1.5 |
| `cav12_pchembl` | regression — raw pIC50 | Cav1.2 |
| `cav12_blocker` | logit | Cav1.2 |
| `iks_blocker` | logit | IKs |
IKs has no regression head (n = 115 labelled compounds; treated as
exploratory). See the
[full model card](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/model/MODEL_CARD.md)
for architecture details.
## Usage
The recommended path is the runnable inference shipped in the GitHub
repo. It handles all featurization (RDKit + ChemBERTa + L1000 encoder)
and the ensemble forward pass:
```bash
git clone https://github.com/AppliedScientific/CardioSafe-benchmark
cd CardioSafe-benchmark
pip install -e .[inference]
# CSV in / CSV out — auto-downloads weights from GitHub Releases on first call
python -m inference.predict --in inference/example_smiles.csv \
--out predictions.csv \
--version v1.1
```
To download these weight files from the HuggingFace mirror instead:
```python
from huggingface_hub import snapshot_download
local = snapshot_download(repo_id="appliedscientific/cardiosafe")
# v1.0/, v1.1/, l1000/ subdirectories under `local`
```
The repo's `inference.ensemble` module loads the seed checkpoints; see
[`inference/README.md`](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/inference/README.md)
for the loader API and a Python example.
## Verified
Loading the v1.1 weights into the public
`model.cross_attn.CrossAttnIonChannelPredictor` and running the
cardiac-cliff anchors reproduces the published v1.1 case-study values to
within 0.01: terfenadine pIC50 6.258 (published 6.247), fexofenadine
pIC50 4.505 (4.512), cliff 1.754 (1.736).
## License
[CC-BY-NC-4.0](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS).
Academic, educational, and non-profit research use is permitted with
attribution. Commercial use requires a separate license — contact the
authors (`lukas@appliedscientific.ai`, `mihailo@appliedscientific.ai`).
The code in the GitHub repository is MIT; the dataset there is CC-BY-4.0.
Only the model weights distributed here and in the GitHub Releases are
CC-BY-NC-4.0.
## Citation
```bibtex
@article{cardiosafe2026,
title = {CardioSafe: multi-task prediction of cardiac ion channel
activity with reverse-leak audited benchmarking},
author = {Jovanović, Mihailo and Weidener, Lukas and Brkić, Marko and
Ulgac, Emre and Meduri, Aakaash},
year = {2026},
journal = {bioRxiv},
doi = {10.64898/2026.05.06.723181},
url = {https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1}
}
```
|