Publish CardioSafe v1.0 + v1.1 paper-snapshot weights and L1000 encoder

937eff6 verified 5 days ago

6.99 kB

	---
	license: cc-by-nc-4.0
	license_name: cc-by-nc-4.0
	license_link: https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS
	library_name: pytorch
	pipeline_tag: tabular-regression
	tags:
	- chemistry
	- drug-discovery
	- cardiotoxicity
	- hERG
	- ion-channels
	- multi-task
	- molecules
	- QSAR
	language:
	- en
	---

	# CardioSafe — paper-snapshot weights

	Paper-snapshot weights for **CardioSafe: multi-task prediction of cardiac
	ion channel activity with reverse-leak audited benchmarking** (Jovanović
	et al., 2026,
	[bioRxiv](https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1)).

	CardioSafe is a three-branch multi-task neural network that predicts
	blocker status and pIC50 for the four CiPA cardiac ion channels — **hERG,
	Nav1.5, Cav1.2, and (exploratory) IKs** — trained on the largest
	publicly reported multi-channel cardiac ion channel dataset (ChEMBL 36 +
	hERG Central, 334,444 curated compounds, 8 heads).

	This HuggingFace repo is a mirror. The canonical home is
	[github.com/AppliedScientific/CardioSafe-benchmark](https://github.com/AppliedScientific/CardioSafe-benchmark),
	which ships the curated dataset, splits, supplementary materials, the
	reverse-leak audit script, the reference model + training-step code, and
	runnable inference (`inference/predict.py`). The continually-updated
	deployed ensemble is served at
	[platform.appliedscientific.ai/cardiosafe](https://platform.appliedscientific.ai/cardiosafe).

	## Files

	```
	v1.0/ # preprint snapshot, 5-seed ensemble
	cardiosafe_v1.0_seed_{42..46}.pt # 15 MB each
	v1.1/ # audit-clean snapshot, 5-seed ensemble
	cardiosafe_v1.1_seed_{42..46}.pt # 15 MB each — RECOMMENDED for new work
	l1000/
	l1000_encoder.pt # 10 MB — shared by v1.0 + v1.1
	l1000_per_gene_pearson.json # per-gene test-set Pearson r (diagnostic)
	```

	Each `.pt` contains `model_state_dict`, descriptor / L1000 / regression-head
	scalers, and a clean config dict. The L1000 encoder checkpoint additionally
	contains the gene co-expression `edge_index` and per-gene scaler stats.

	## v1.0 vs v1.1

	- v1.0 is the exact ensemble evaluated in the bioRxiv preprint.
	- v1.1 is an audit-clean retrain: the exhaustive
	O(n_train × n_other) Tanimoto leakage audit flagged 12 train↔val edges
	in tan70 v1.0 at Morgan-r2-2048 Tanimoto ≥ 0.70, all within the
	canonical cardiac-cliff cluster (terfenadine / fexofenadine /
	hydroxymethyl-terfenadine analogs). v1.1 force-routes the 2 HMT
	analogs (rows 317153, 331406) to val so the cluster is fully
	audit-clean.
	- Test fold is identical between v1.0 and v1.1 — headline test
	metrics (Tables 2 / 3 of the paper) are unchanged. v1.1 just gives an
	audit-clean training set for the per-seed val fold selection.
	- See [Note S3](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/data/supplementary/note_s3_v1_1_audit_correction.md)
	for the full audit findings + re-evaluation of the cardiac-cliff case study.

	Use v1.1 for new work. v1.0 is retained so the preprint numbers stay
	reproducible.

	## Inputs and outputs

	The model expects a single flat `float32` tensor of shape `(B, 7526)`:

	\| dims \| block \| source \|
	\| --- \| --- \| --- \|
	\| 0 – 2047 \| Morgan radius-2 2048-bit binary fingerprint \| RDKit `GetMorganGenerator(radius=2, fpSize=2048)` \|
	\| 2048 – 4095 \| AtomPair 2048-bit binary fingerprint \| RDKit `GetAtomPairGenerator(fpSize=2048)` \|
	\| 4096 – 6143 \| TopologicalTorsion 2048-bit binary fingerprint \| RDKit `GetTopologicalTorsionGenerator(fpSize=2048)` \|
	\| 6144 – 6163 \| 20-descriptor block, training-fold z-scored \| Spec in `data/supplementary/table_s0_descriptor_spec.*` \|
	\| 6164 – 6547 \| ChemBERTa-77M-MTR mean-pooled embedding (384) \| `model/chemberta_encoder.py` \|
	\| 6548 – 7525 \| L1000 predicted expression z-scores (978) \| `model/l1000_encoder.py` \|

	`forward(x)` returns a `dict[str, Tensor]` with 8 keys, each value a `(B,)` tensor:

	\| Head \| Output \| Channel \|
	\| --- \| --- \| --- \|
	\| `herg_pchembl` \| regression — raw pIC50 \| hERG \|
	\| `herg_blocker_10um` \| logit (apply sigmoid for P) \| hERG \|
	\| `herg_blocker_1um` \| logit \| hERG \|
	\| `nav15_pchembl` \| regression — raw pIC50 \| Nav1.5 \|
	\| `nav15_blocker` \| logit \| Nav1.5 \|
	\| `cav12_pchembl` \| regression — raw pIC50 \| Cav1.2 \|
	\| `cav12_blocker` \| logit \| Cav1.2 \|
	\| `iks_blocker` \| logit \| IKs \|

	IKs has no regression head (n = 115 labelled compounds; treated as
	exploratory). See the
	[full model card](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/model/MODEL_CARD.md)
	for architecture details.

	## Usage

	The recommended path is the runnable inference shipped in the GitHub
	repo. It handles all featurization (RDKit + ChemBERTa + L1000 encoder)
	and the ensemble forward pass:

	```bash
	git clone https://github.com/AppliedScientific/CardioSafe-benchmark
	cd CardioSafe-benchmark
	pip install -e .[inference]

	# CSV in / CSV out — auto-downloads weights from GitHub Releases on first call
	python -m inference.predict --in inference/example_smiles.csv \
	--out predictions.csv \
	--version v1.1
	```

	To download these weight files from the HuggingFace mirror instead:

	```python
	from huggingface_hub import snapshot_download

	local = snapshot_download(repo_id="appliedscientific/cardiosafe")
	# v1.0/, v1.1/, l1000/ subdirectories under `local`
	```

	The repo's `inference.ensemble` module loads the seed checkpoints; see
	[`inference/README.md`](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/inference/README.md)
	for the loader API and a Python example.

	## Verified

	Loading the v1.1 weights into the public
	`model.cross_attn.CrossAttnIonChannelPredictor` and running the
	cardiac-cliff anchors reproduces the published v1.1 case-study values to
	within 0.01: terfenadine pIC50 6.258 (published 6.247), fexofenadine
	pIC50 4.505 (4.512), cliff 1.754 (1.736).

	## License

	[CC-BY-NC-4.0](https://github.com/AppliedScientific/CardioSafe-benchmark/blob/main/LICENSE-WEIGHTS).
	Academic, educational, and non-profit research use is permitted with
	attribution. Commercial use requires a separate license — contact the
	authors (`lukas@appliedscientific.ai`, `mihailo@appliedscientific.ai`).

	The code in the GitHub repository is MIT; the dataset there is CC-BY-4.0.
	Only the model weights distributed here and in the GitHub Releases are
	CC-BY-NC-4.0.

	## Citation

	```bibtex
	@article{cardiosafe2026,
	title = {CardioSafe: multi-task prediction of cardiac ion channel
	activity with reverse-leak audited benchmarking},
	author = {Jovanović, Mihailo and Weidener, Lukas and Brkić, Marko and
	Ulgac, Emre and Meduri, Aakaash},
	year = {2026},
	journal = {bioRxiv},
	doi = {10.64898/2026.05.06.723181},
	url = {https://www.biorxiv.org/content/10.64898/2026.05.06.723181v1}
	}
	```