Spaces:
Sleeping
Responsible ML Analysis
DDI Checker β DrugBank interaction graph (4,795 drugs Β· 824,249 pairs)
We address all 4 required Responsible ML topics (RM1βRM4).
RM2 and RM4 are fully automated; results are regenerated by pipeline/responsible_ml.py.
RM1 β Explainability
Dict Lookup (primary path)
The DrugBank dictionary lookup is fully interpretable by design. When a pair is found, the system returns the exact natural-language description stored in DrugBank:
"Acetylsalicylic acid may increase the anticoagulant activities of Warfarin resulting in
an increased risk of bleeding and haemorrhage."
No black-box reasoning is involved. The user can trace the result directly to its primary source.
Logistic Regression (non-graph baseline β TM10G)
Logistic Regression produces coefficient-level feature attribution. The top features (by
|coefficient|) are available in data/evaluation/baselines_results.json and displayed on the
/results page. Key findings:
| Feature prefix | Meaning | Top example |
|---|---|---|
prod_cat_* |
Both drugs share a pharmacological category | prod_cat_central_nervous_system_depre β +1.06 |
diff_* |
Absolute difference in a continuous property | diff_monoisotopic_mass β +2.17 |
prod_* (continuous) |
Element-wise product of continuous features | prod_monoisotopic_mass β β1.10 |
The prod_cat_* features are directly clinically meaningful: if both drugs are "CYP3A4 substrates",
they compete for the same metabolic pathway, directly causing pharmacokinetic interactions.
GNN β feature ablation
SAGEConv is non-linear, so LR-style coefficient attribution does not apply directly. We use feature-group ablation and decoder component ablation to quantify what drives each prediction.
Feature ablation (warm AUROC drop when feature group is removed):
| Group removed | AUROC | Drop |
|---|---|---|
| Nothing (full 980-dim) | 0.9738 | β |
| Structural features [0:212] | 0.8993 | β0.0745 |
| PubMedBERT embeddings [212:980] | 0.9661 | β0.0077 |
Structural features dominate: physicochemical and pharmacological dimensions carry the strongest signal; BERT embeddings add a small but consistent improvement.
CN pooling ablation (which decoder components contribute):
| Decoder variant | AUROC | Drop |
|---|---|---|
| Full NCN pooling | 0.9738 | β |
| Remove shared DDI neighbours | 0.9719 | β0.0019 |
| Remove shared protein targets | 0.9717 | β0.0021 |
| Remove both (plain MLP) | 0.9724 | β0.0014 |
Full tables are in docs/model_architecture.md.
Source transparency label
Every prediction carries a source field: "documented" (DrugBank verbatim hit), "gnn_predicted" (GNN score above threshold 0.43), or "not_found". This allows users to calibrate trust β a documented DDI includes a literature-backed description; a GNN-predicted one does not.
"Why this was flagged" box (GNN predictions)
For every gnn_predicted result, the checker UI renders a human-readable evidence summary via gnn_predictor.explain(). The box surfaces:
- Shared protein targets β e.g. "Shares 3 common protein targets: CYP2C9, CYP1A2, VKORC1"
- Shared DDI neighbours β e.g. "12 common DDI neighbours β strong graph connectivity"
- Structural overlap β shared CYP substrate classifications, molecular weight proximity
These reasons derive from the same graph signals that drive the NCN decoder (shared-neighbour pooling + node features), closing the explainability loop from model internals to user-visible output. Users never see a bare probability β they always see why.
RM2 β Bias / Fairness
Script: pipeline/responsible_ml.py --section bias
Output: data/evaluation/responsible_ml_bias.json
Live: /responsible page, RM2 section
Finding
DrugBank interaction data is heavily skewed toward certain drug classes:
| Category | Mean Degree | Isolated % |
|---|---|---|
| NERVOUS SYSTEM (best) | 864 | 3.7 % |
| CARDIOVASCULAR SYSTEM | 725 | 4.1 % |
| DERMATOLOGICALS | 274 | 29.5 % |
| VARIOUS (worst) | 213 | 43.2 % |
Coverage ratio: 4.1Γ (best vs. worst category).
Hub drugs
Ten drugs have > 1,700 documented interaction partners (e.g. Clozapine: 1,907). These hub drugs are largely CNS agents and CYP3A4 substrates / inhibitors. Their pharmacological profile is heavily over-represented in the training set.
Consequences for the GNN
- Well-studied drugs (nervous system, cardiovascular): GNN link predictor has dense neighbourhood signals β high expected AUC.
- Sparse categories (VARIOUS, DERMATOLOGICALS): many drugs have degree 0 or low degree. The GNN falls back to node features (similar to LR), not graph topology.
- Cold-start drugs (novel drugs with no training edges): heuristics score 0; LR and GNN must rely entirely on node features.
Per-category GNN AUC (TM6 error analysis)
Script: pipeline/responsible_ml.py --section gnn_auc
Output: data/evaluation/responsible_ml_gnn_auc.json
Live: /responsible page, RM2 section
run_per_category_gnn_auc() loads the warm test split (edge_split.npz), scores every
test pair via gnn_predictor.predict(), groups results by the drug's top-level ATC category
(l4_name), and computes AUC-ROC + average precision per category. Ξ vs overall AUC
directly exposes which drug classes the GNN underperforms on β confirming the training-data bias.
Other mitigations
- Degree-stratified evaluation (low / medium / high degree buckets) is a natural next step.
- For production, track data freshness: interactions discovered after the DrugBank v5.1 snapshot are not captured.
RM3 β Privacy / Data Leakage
Data Sources
All data originates from DrugBank Full Database v5.1 (CC BY-NC 4.0 licence). DrugBank is a publicly available curated database of approved drugs and their interactions.
| Data element | Source | Identifiability |
|---|---|---|
| Drug structures, properties | DrugBank XML | Public |
| DDI descriptions | DrugBank XML | Public |
| ATC codes, pathways, MeSH categories | DrugBank XML | Public |
| Text embeddings (PubMedBERT) | Computed from above | Derived β no new personal data |
| GNN model weights | Trained on DrugBank graph | Derived β no new personal data |
No patient data
The system contains zero patient records, zero clinical trial participant data, and zero electronic health records. There is no re-identification risk.
Data leakage in evaluation
The train/test split (data/evaluation/edge_split.npz) is performed on edges (DDI pairs),
not on drugs. Node features are available for all drugs in both train and test β this is the
standard transductive graph learning setting and is not leakage.
Negative pairs for evaluation are sampled uniformly at random from pairs not in DrugBank. Because DrugBank is incomplete (absence β no interaction), some "negatives" may be undocumented true interactions. This is acknowledged as open-world assumption noise and is standard in the DDI literature.
Licence compliance
- DrugBank: CC BY-NC 4.0 β academic / non-commercial use only.
- PubMedBERT (
pritamdeka/S-PubMedBert-MS-MARCO): Apache 2.0. - Groq API (llama-3.3-70b-versatile): usage subject to Groq ToS β free tier.
- No proprietary data is included or distributed.
RM4 β Robustness / Distribution Shift
Script: pipeline/responsible_ml.py --section robustness
Output: data/evaluation/responsible_ml_robust.json
Live: /responsible page, RM4 section
Test battery (20 cases)
The resolve_drug() function (pipeline/ddi_query.py) is tested against realistic input variations.
| Category | Cases | Result |
|---|---|---|
| Case variants (lower, upper, mixed) | 4 | All PASS |
| Common non-brand synonyms (aspirin, adrenaline) | 2 | PASS |
| DrugBank ID input | 1 | PASS |
| Whitespace variants (trailing, leading) | 2 | PASS |
| 1-character typo (warrfarin) | 1 | PASS (correctly rejected) |
| Nonsense / empty / numeric | 3 | PASS (correctly rejected) |
| Brand names (Tylenol, Advil, Prozac, Lipitor) | 4 | PASS (via products.csv) |
| Drug class as input (anticoagulant) | 1 | PASS (correctly rejected) |
| Hydrochloride suffix | 1 | PASS (via products.csv) |
Pass rate: 100 % (20 of 20 cases pass; single-character typos and empty strings correctly rejected by design).
Key findings
Correctly handled:
- Case-insensitive matching for all canonical names
- Common INN synonyms (aspirin β Acetylsalicylic acid, adrenaline β Epinephrine)
- DrugBank IDs as primary keys
- Whitespace stripping
By design β not handled:
- Single-character typos are rejected (conservative design choice). In a clinical safety context a false positive ("banana" β some drug) is more dangerous than a false negative.
- Brand names (Tylenol, Advil, Prozac, Lipitor) β resolved via
data/step3_approved/products.csv(473,660 product entries across 4,109 drugs). The synonym map now covers INN names, DrugBank synonyms, and commercial brand names in a single lookup.
Distribution shift β data currency:
DrugBank v5.1 contains 4,795 approved drugs. Any drug approved after this snapshot will not
be found by resolve_drug(). This is an inherent data currency limitation, not a model failure.
The GNN flag (source: "gnn_predicted") partially mitigates this for novel drug pairs among
existing drugs with undocumented interactions.
Generated by pipeline/responsible_ml.py.
Live pages: /responsible (all RM sections) Β· /results (model performance).