| --- |
| license: cc-by-nc-4.0 |
| license_name: cc-by-nc-4.0 |
| license_link: https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS |
| library_name: pytorch |
| pipeline_tag: tabular-regression |
| tags: |
| - chemistry |
| - drug-discovery |
| - cardiotoxicity |
| - hERG |
| - ion-channels |
| - multi-task |
| - molecules |
| - QSAR |
| language: |
| - en |
| --- |
| |
| # CardioSafe — paper-snapshot weights |
|
|
| Paper-snapshot weights for **CardioSafe: multi-task prediction of cardiac |
| ion channel activity with reverse-leak audited benchmarking** (Jovanović |
| et al., 2026, |
| [bioRxiv](https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1)). |
|
|
| CardioSafe is a three-branch multi-task neural network that predicts |
| blocker status and pIC50 for the four CiPA cardiac ion channels — **hERG, |
| Nav1.5, Cav1.2, and (exploratory) IKs** — trained on the largest |
| publicly reported multi-channel cardiac ion channel dataset (ChEMBL 36 + |
| hERG Central, 334,444 curated compounds, 8 heads). |
|
|
| This HuggingFace repo is a mirror. The canonical home is |
| [github.com/AppliedScientific/CardioSafe-benchmark](https://github.com/AppliedScientific/CardioSafe-benchmark), |
| which ships the curated dataset, splits, supplementary materials, the |
| reverse-leak audit script, the reference model + training-step code, and |
| runnable inference (`inference/predict.py`). The continually-updated |
| deployed ensemble is served at |
| [platform.appliedscientific.ai/cardiosafe](https://platform.appliedscientific.ai/cardiosafe). |
|
|
| ## Files |
|
|
| ``` |
| v1.0/ # preprint snapshot, 5-seed ensemble |
| cardiosafe_v1.0_seed_{42..46}.pt # 15 MB each |
| v1.1/ # audit-clean snapshot, 5-seed ensemble |
| cardiosafe_v1.1_seed_{42..46}.pt # 15 MB each — RECOMMENDED for new work |
| l1000/ |
| l1000_encoder.pt # 10 MB — shared by v1.0 + v1.1 |
| l1000_per_gene_pearson.json # per-gene test-set Pearson r (diagnostic) |
| ``` |
|
|
| Each `.pt` contains `model_state_dict`, descriptor / L1000 / regression-head |
| scalers, and a clean config dict. The L1000 encoder checkpoint additionally |
| contains the gene co-expression `edge_index` and per-gene scaler stats. |
|
|
| ## v1.0 vs v1.1 |
|
|
| - **v1.0** is the exact ensemble evaluated in the bioRxiv preprint. |
| - **v1.1** is an audit-clean retrain: the exhaustive |
| O(n_train × n_other) Tanimoto leakage audit flagged 12 train↔val edges |
| in tan70 v1.0 at Morgan-r2-2048 Tanimoto ≥ 0.70, all within the |
| canonical cardiac-cliff cluster (terfenadine / fexofenadine / |
| hydroxymethyl-terfenadine analogs). v1.1 force-routes the 2 HMT |
| analogs (rows 317153, 331406) to val so the cluster is fully |
| audit-clean. |
| - **Test fold is identical** between v1.0 and v1.1 — headline test |
| metrics (Tables 2 / 3 of the paper) are unchanged. v1.1 just gives an |
| audit-clean training set for the per-seed val fold selection. |
| - See [Note S3](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/data/supplementary/note_s3_v1_1_audit_correction.md) |
| for the full audit findings + re-evaluation of the cardiac-cliff case study. |
|
|
| **Use v1.1 for new work.** v1.0 is retained so the preprint numbers stay |
| reproducible. |
|
|
| ## Inputs and outputs |
|
|
| The model expects a single flat `float32` tensor of shape `(B, 7526)`: |
|
|
| | dims | block | source | |
| | --- | --- | --- | |
| | 0 – 2047 | Morgan radius-2 2048-bit binary fingerprint | RDKit `GetMorganGenerator(radius=2, fpSize=2048)` | |
| | 2048 – 4095 | AtomPair 2048-bit binary fingerprint | RDKit `GetAtomPairGenerator(fpSize=2048)` | |
| | 4096 – 6143 | TopologicalTorsion 2048-bit binary fingerprint | RDKit `GetTopologicalTorsionGenerator(fpSize=2048)` | |
| | 6144 – 6163 | 20-descriptor block, training-fold z-scored | Spec in `data/supplementary/table_s0_descriptor_spec.*` | |
| | 6164 – 6547 | ChemBERTa-77M-MTR mean-pooled embedding (384) | `model/chemberta_encoder.py` | |
| | 6548 – 7525 | L1000 predicted expression z-scores (978) | `model/l1000_encoder.py` | |
|
|
| `forward(x)` returns a `dict[str, Tensor]` with 8 keys, each value a `(B,)` tensor: |
|
|
| | Head | Output | Channel | |
| | --- | --- | --- | |
| | `herg_pchembl` | regression — raw pIC50 | hERG | |
| | `herg_blocker_10um` | logit (apply sigmoid for P) | hERG | |
| | `herg_blocker_1um` | logit | hERG | |
| | `nav15_pchembl` | regression — raw pIC50 | Nav1.5 | |
| | `nav15_blocker` | logit | Nav1.5 | |
| | `cav12_pchembl` | regression — raw pIC50 | Cav1.2 | |
| | `cav12_blocker` | logit | Cav1.2 | |
| | `iks_blocker` | logit | IKs | |
|
|
| IKs has no regression head (n = 115 labelled compounds; treated as |
| exploratory). See the |
| [full model card](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/model/MODEL_CARD.md) |
| for architecture details. |
|
|
| ## Usage |
|
|
| The recommended path is the runnable inference shipped in the GitHub |
| repo. It handles all featurization (RDKit + ChemBERTa + L1000 encoder) |
| and the ensemble forward pass: |
|
|
| ```bash |
| git clone https://github.com/AppliedScientific/CardioSafe-benchmark |
| cd CardioSafe-benchmark |
| pip install -e .[inference] |
| |
| # CSV in / CSV out — auto-downloads weights from GitHub Releases on first call |
| python -m inference.predict --in inference/example_smiles.csv \ |
| --out predictions.csv \ |
| --version v1.1 |
| ``` |
|
|
| To download these weight files from the HuggingFace mirror instead: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| local = snapshot_download(repo_id="appliedscientific/cardiosafe") |
| # v1.0/, v1.1/, l1000/ subdirectories under `local` |
| ``` |
|
|
| The repo's `inference.ensemble` module loads the seed checkpoints; see |
| [`inference/README.md`](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/inference/README.md) |
| for the loader API and a Python example. |
|
|
| ## Verified |
|
|
| Loading the v1.1 weights into the public |
| `model.cross_attn.CrossAttnIonChannelPredictor` and running the |
| cardiac-cliff anchors reproduces the published v1.1 case-study values to |
| within 0.01: terfenadine pIC50 6.258 (published 6.247), fexofenadine |
| pIC50 4.505 (4.512), cliff 1.754 (1.736). |
|
|
| ## License |
|
|
| [CC-BY-NC-4.0](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS). |
| Academic, educational, and non-profit research use is permitted with |
| attribution. Commercial use requires a separate license — contact the |
| authors (`lukas@appliedscientific.ai`, `mihailo@appliedscientific.ai`). |
|
|
| The code in the GitHub repository is MIT; the dataset there is CC-BY-4.0. |
| Only the model weights distributed here and in the GitHub Releases are |
| CC-BY-NC-4.0. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{cardiosafe2026, |
| title = {CardioSafe: multi-task prediction of cardiac ion channel |
| activity with reverse-leak audited benchmarking}, |
| author = {Jovanović, Mihailo and Weidener, Lukas and Brkić, Marko and |
| Ulgac, Emre and Meduri, Aakaash}, |
| year = {2026}, |
| journal = {bioRxiv}, |
| doi = {10.64898/2026.05.06.723181}, |
| url = {https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1} |
| } |
| ``` |
|
|