Multi-task DDI Prediction Model (Research Artifact)

Do not deploy this model in any clinical context. This is a research artifact from a methodology audit. A simpler MLP baseline on the same features outperforms it by 9.3 accuracy points and 14.8 macro-F1 points. The checkpoints are released solely so the audit results can be reproduced. See the GitHub repository and the manuscript for full context.

What this is

Multi-task graph neural network for drug-drug interaction prediction, trained on a benchmark constructed from DrugBank 5.1.15 and DDInter. Predicts:

Severity — 4 classes: none, Minor, Moderate, Major
Mechanism — 7 binary heads: CYP induction, CYP inhibition, QT prolongation, additive toxicity, absorption interference, protein binding, renal excretion

Architecture: GATv2 encoder (3 layers, 4 heads, hidden 128) + JumpingKnowledge max-pooling + mean-max readout, plus a 2-layer MLP branch over a 10-dimensional leakage-corrected pharmacokinetic vector. Per-drug embeddings are concatenated, fed into an interaction trunk over [h_a; h_b; h_a ⊙ h_b; |h_a − h_b|], then split into severity (4 logits) and mechanism (7 logits) heads. Full architectural details in Section 3.1 of the manuscript.

Why this exists

This model was trained as one of seven variants serving as instruments for a benchmark validation audit, not as a DDI screening tool. The audit found three patterns by which standard benchmark assembly choices systematically disadvantage drugs central to global health:

Cross-database name resolution silently drops drugs. An exact-string DrugBank ↔ DDInter merge initially excluded 1,825 documented interactions (including all 284 rifampin pairs) because DDInter uses WHO INN (rifampicin) and DrugBank uses American generic names (rifampin). Alias correction recovers them.
Random pair splits make cold-start evaluation structurally impossible. 99.96% of test pairs share both drugs with the training set. The benchmark cannot tell you whether the model generalizes to novel agents.
Aggregate accuracy masks architecture-specific failure modes. This GNN over-alerts on the rare Minor severity class (precision 0.36, recall 0.88); the MLP baseline under-alerts (precision 0.87, recall 0.68). Bootstrap CIs on per-class precision don't overlap. Both have similar F1 but qualitatively different clinical alert behavior.

A bonus contribution surfaced during the audit: indirect label leakage in the pharmacokinetic features. CYP-inducer flags carried 2.5× lift on CYP-induction labels (both originate from the same DrugBank curation). 20 of 30 PK columns were dropped to fix this. All checkpoints here use the leakage-corrected 10-column vector.

Intended use

Primary use: Reproducing the audit results in the accompanying paper. The checkpoints exist so reviewers can rerun the evaluation suite (evaluate.py, ablation eval, agreement analysis, cold-start partitioning, statistical tests, TB-tier breakdown) without retraining from scratch — full retraining takes 5–8 hours on Apple M4 Max.

Out of scope (do not do any of this):

Clinical decision support
Patient-facing DDI alerts
Prescription validation
Pharmacy software integration
Any downstream task where false negatives could harm patients
Generalization to drugs not in the training distribution (the cold-start evaluation is structurally empty; see finding 2)

Training data

Source	Used for
DrugBank 5.1.15	Drug structures (SMILES), enzyme/transporter annotations, free-text interaction descriptions (mechanism regex extraction)
DDInter	Severity labels (Major/Moderate/Minor)
Sampled negatives	The `none` class — 1:1 ratio against documented positives, sampled from valid drug combinations absent from DrugBank

Final dataset: 130,014 pairs (65,007 documented positive + 65,007 sampled negative), severity-stratified 80/10/10 split with seed 42.

Positive-unlabeled caveat: any pair not recorded in DrugBank is treated as non-interacting during training. This may underestimate false negatives, especially for less-studied drug pairs.

Mechanism label provenance: mechanism labels come from regex pattern-matching over DrugBank free-text descriptions. 19.1% of positive pairs match no mechanism keyword and are excluded from the mechanism head's training signal. Reported mechanism-head performance reflects the regex labels, not ground-truth pharmacology.

Evaluation

Single fixed test set (n = 13,002). GNN rows are this checkpoint; MLP row is the 5-fold CV ensemble (Morgan FP + PK) which outperforms this model on every aggregate metric except calibration.

Model	Accuracy	Macro-F1	Sev AUROC	Mech AUROC	ECE
GNN+PK (this model)	0.862	0.748	0.973	0.948	0.008
MLP baseline (5-fold ensemble)	0.955	0.896	0.991	0.984	0.011

McNemar p < 10⁻⁶ for the MLP-vs-GNN comparison; paired-bootstrap 95% CI of the accuracy gap is [0.087, 0.099]. The GNN's only edge over the MLP is on rare-class Minor recall (0.88 vs 0.68) and overall calibration (ECE 0.008 vs 0.011) — both at the cost of severe over-alerting (Minor precision 0.36 vs 0.87).

On TB-relevant subsets (Section 5.5 of the manuscript), all models degrade. GNN+PK calibration jumps from 0.008 on the full test set to 0.091 on first-line TB pairs and 0.135 on ARV co-administration pairs. The model that looks well-calibrated overall is poorly calibrated on the specific subgroups most relevant for TB-HIV prescribing.

A perfect (16/16) rifampin CYP-induction precision/recall was obtained on the rifampin pairs that survived alias correction — but n=16 is far too small to validate the model for clinical use.

Files

File	Description
`ddi_best.pt`	Full multi-task GNN+PK, best validation checkpoint
`ablation_gnn_only.pt`	Same architecture without the PK branch (severity acc 0.849)
`ablation_pk_only.pt`	PK branch only, no molecular graph (severity acc 0.635)
`ablation_single_task.pt`	Full architecture trained on severity only, mechanism loss disabled (severity acc 0.851)
`manuscript.pdf`	Full audit write-up (Section 3 has architecture; Section 4 has data pipeline including the leakage correction; Section 5 has full results)

The baseline (RF, MLP, XGB) checkpoints live in the GitHub repository's models/ directory if regenerated locally; they are not mirrored here because they're sklearn .pkl files and not the focus of the model card.

How to load

import torch
from gnn import DDIModel        # from the GitHub repo
from config import ProjectConfig

cfg = ProjectConfig()
model = DDIModel(cfg.atom, cfg.gnn)
state = torch.load("ddi_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()

You will need the code from the GitHub repository — the checkpoint is a state_dict that requires the DDIModel class to instantiate.

Limitations

Single-benchmark audit; cross-benchmark validation is needed before generalizing the three findings. Mechanism labels are regex-derived from DrugBank free text. Positive-unlabeled assumption on negatives may underestimate false negatives. Cold-start evaluation is structurally empty on this benchmark (finding 2). The TB cohort includes only 18 drugs that passed the alias-corrected merge with sufficient test coverage; newer agents (delamanid, pretomanid, dolutegravir) lack test pairs. Encoder is a moderate-depth GATv2 without pretraining; pretrained encoders may shift absolute metrics but are unlikely to overturn the audit's conclusions, since the fundamental asymmetry — Morgan fingerprints encode 2,048 bits of substructural information directly while the GNN must learn it from 104,011 pairs — remains.

Citation

@misc{keith_ddi_audit_2026,
  author = {Arlene Keith},
  title  = {Validation Gaps in Drug-Drug Interaction Prediction Benchmarks for Tuberculosis-HIV Pharmacotherapy},
  year   = {2026},
  note   = {Capstone project, AI in Healthcare, Johns Hopkins University, Spring 2026, unpublished},
  url    = {https://github.com/jsf3467v/multi-task-ddi-audit}
}

Acknowledgements

DrugBank (Wishart et al., 2018) and DDInter (Xiong et al., 2022) are the data sources. The audit framing builds on prior clinical ML benchmark critiques by Wong et al. (2021), Kapoor and Narayanan (2023), Huang et al. (2021), and Shen et al. (2025). Full references are in the manuscript.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support