Multi-task DDI Prediction Model (Research Artifact)
Do not deploy this model in any clinical context. This is a research artifact from a methodology audit. A simpler MLP baseline on the same features outperforms it by 9.3 accuracy points and 14.8 macro-F1 points. The checkpoints are released solely so the audit results can be reproduced. See the GitHub repository and the manuscript for full context.
What this is
Multi-task graph neural network for drug-drug interaction prediction, trained on a benchmark constructed from DrugBank 5.1.15 and DDInter. Predicts:
- Severity β 4 classes:
none,Minor,Moderate,Major - Mechanism β 7 binary heads: CYP induction, CYP inhibition, QT prolongation, additive toxicity, absorption interference, protein binding, renal excretion
Architecture: GATv2 encoder (3 layers, 4 heads, hidden 128) + JumpingKnowledge max-pooling + mean-max readout, plus a 2-layer MLP branch over a 10-dimensional leakage-corrected pharmacokinetic vector. Per-drug embeddings are concatenated, fed into an interaction trunk over [h_a; h_b; h_a β h_b; |h_a β h_b|], then split into severity (4 logits) and mechanism (7 logits) heads. Full architectural details in Section 3.1 of the manuscript.
Why this exists
This model was trained as one of seven variants serving as instruments for a benchmark validation audit, not as a DDI screening tool. The audit found three patterns by which standard benchmark assembly choices systematically disadvantage drugs central to global health:
- Cross-database name resolution silently drops drugs. An exact-string DrugBank β DDInter merge initially excluded 1,825 documented interactions (including all 284 rifampin pairs) because DDInter uses WHO INN (
rifampicin) and DrugBank uses American generic names (rifampin). Alias correction recovers them. - Random pair splits make cold-start evaluation structurally impossible. 99.96% of test pairs share both drugs with the training set. The benchmark cannot tell you whether the model generalizes to novel agents.
- Aggregate accuracy masks architecture-specific failure modes. This GNN over-alerts on the rare
Minorseverity class (precision 0.36, recall 0.88); the MLP baseline under-alerts (precision 0.87, recall 0.68). Bootstrap CIs on per-class precision don't overlap. Both have similar F1 but qualitatively different clinical alert behavior.
A bonus contribution surfaced during the audit: indirect label leakage in the pharmacokinetic features. CYP-inducer flags carried 2.5Γ lift on CYP-induction labels (both originate from the same DrugBank curation). 20 of 30 PK columns were dropped to fix this. All checkpoints here use the leakage-corrected 10-column vector.
Intended use
Primary use: Reproducing the audit results in the accompanying paper. The checkpoints exist so reviewers can rerun the evaluation suite (evaluate.py, ablation eval, agreement analysis, cold-start partitioning, statistical tests, TB-tier breakdown) without retraining from scratch β full retraining takes 5β8 hours on Apple M4 Max.
Out of scope (do not do any of this):
- Clinical decision support
- Patient-facing DDI alerts
- Prescription validation
- Pharmacy software integration
- Any downstream task where false negatives could harm patients
- Generalization to drugs not in the training distribution (the cold-start evaluation is structurally empty; see finding 2)
Training data
| Source | Used for |
|---|---|
| DrugBank 5.1.15 | Drug structures (SMILES), enzyme/transporter annotations, free-text interaction descriptions (mechanism regex extraction) |
| DDInter | Severity labels (Major/Moderate/Minor) |
| Sampled negatives | The none class β 1:1 ratio against documented positives, sampled from valid drug combinations absent from DrugBank |
Final dataset: 130,014 pairs (65,007 documented positive + 65,007 sampled negative), severity-stratified 80/10/10 split with seed 42.
Positive-unlabeled caveat: any pair not recorded in DrugBank is treated as non-interacting during training. This may underestimate false negatives, especially for less-studied drug pairs.
Mechanism label provenance: mechanism labels come from regex pattern-matching over DrugBank free-text descriptions. 19.1% of positive pairs match no mechanism keyword and are excluded from the mechanism head's training signal. Reported mechanism-head performance reflects the regex labels, not ground-truth pharmacology.
Evaluation
Single fixed test set (n = 13,002). GNN rows are this checkpoint; MLP row is the 5-fold CV ensemble (Morgan FP + PK) which outperforms this model on every aggregate metric except calibration.
| Model | Accuracy | Macro-F1 | Sev AUROC | Mech AUROC | ECE |
|---|---|---|---|---|---|
| GNN+PK (this model) | 0.862 | 0.748 | 0.973 | 0.948 | 0.008 |
| MLP baseline (5-fold ensemble) | 0.955 | 0.896 | 0.991 | 0.984 | 0.011 |
McNemar p < 10β»βΆ for the MLP-vs-GNN comparison; paired-bootstrap 95% CI of the accuracy gap is [0.087, 0.099]. The GNN's only edge over the MLP is on rare-class Minor recall (0.88 vs 0.68) and overall calibration (ECE 0.008 vs 0.011) β both at the cost of severe over-alerting (Minor precision 0.36 vs 0.87).
On TB-relevant subsets (Section 5.5 of the manuscript), all models degrade. GNN+PK calibration jumps from 0.008 on the full test set to 0.091 on first-line TB pairs and 0.135 on ARV co-administration pairs. The model that looks well-calibrated overall is poorly calibrated on the specific subgroups most relevant for TB-HIV prescribing.
A perfect (16/16) rifampin CYP-induction precision/recall was obtained on the rifampin pairs that survived alias correction β but n=16 is far too small to validate the model for clinical use.
Files
| File | Description |
|---|---|
ddi_best.pt |
Full multi-task GNN+PK, best validation checkpoint |
ablation_gnn_only.pt |
Same architecture without the PK branch (severity acc 0.849) |
ablation_pk_only.pt |
PK branch only, no molecular graph (severity acc 0.635) |
ablation_single_task.pt |
Full architecture trained on severity only, mechanism loss disabled (severity acc 0.851) |
manuscript.pdf |
Full audit write-up (Section 3 has architecture; Section 4 has data pipeline including the leakage correction; Section 5 has full results) |
The baseline (RF, MLP, XGB) checkpoints live in the GitHub repository's models/ directory if regenerated locally; they are not mirrored here because they're sklearn .pkl files and not the focus of the model card.
How to load
import torch
from gnn import DDIModel # from the GitHub repo
from config import ProjectConfig
cfg = ProjectConfig()
model = DDIModel(cfg.atom, cfg.gnn)
state = torch.load("ddi_best.pt", map_location="cpu", weights_only=True)
model.load_state_dict(state)
model.eval()
You will need the code from the GitHub repository β the checkpoint is a state_dict that requires the DDIModel class to instantiate.
Limitations
Single-benchmark audit; cross-benchmark validation is needed before generalizing the three findings. Mechanism labels are regex-derived from DrugBank free text. Positive-unlabeled assumption on negatives may underestimate false negatives. Cold-start evaluation is structurally empty on this benchmark (finding 2). The TB cohort includes only 18 drugs that passed the alias-corrected merge with sufficient test coverage; newer agents (delamanid, pretomanid, dolutegravir) lack test pairs. Encoder is a moderate-depth GATv2 without pretraining; pretrained encoders may shift absolute metrics but are unlikely to overturn the audit's conclusions, since the fundamental asymmetry β Morgan fingerprints encode 2,048 bits of substructural information directly while the GNN must learn it from 104,011 pairs β remains.
Citation
@misc{keith_ddi_audit_2026,
author = {Arlene Keith},
title = {Validation Gaps in Drug-Drug Interaction Prediction Benchmarks for Tuberculosis-HIV Pharmacotherapy},
year = {2026},
note = {Capstone project, AI in Healthcare, Johns Hopkins University, Spring 2026, unpublished},
url = {https://github.com/jsf3467v/multi-task-ddi-audit}
}
Acknowledgements
DrugBank (Wishart et al., 2018) and DDInter (Xiong et al., 2022) are the data sources. The audit framing builds on prior clinical ML benchmark critiques by Wong et al. (2021), Kapoor and Narayanan (2023), Huang et al. (2021), and Shen et al. (2025). Full references are in the manuscript.