ddi-checker / docs /responsible_ml.md
marwadeeb's picture
final edit, hopefully
522cb9b

Responsible ML Analysis

DDI Checker β€” DrugBank interaction graph (4,795 drugs Β· 824,249 pairs)

We address all 4 required Responsible ML topics (RM1–RM4). RM2 and RM4 are fully automated; results are regenerated by pipeline/responsible_ml.py.


RM1 β€” Explainability

Dict Lookup (primary path)

The DrugBank dictionary lookup is fully interpretable by design. When a pair is found, the system returns the exact natural-language description stored in DrugBank:

"Acetylsalicylic acid may increase the anticoagulant activities of Warfarin resulting in
an increased risk of bleeding and haemorrhage."

No black-box reasoning is involved. The user can trace the result directly to its primary source.

Logistic Regression (non-graph baseline β€” TM10G)

Logistic Regression produces coefficient-level feature attribution. The top features (by |coefficient|) are available in data/evaluation/baselines_results.json and displayed on the /results page. Key findings:

Feature prefix Meaning Top example
prod_cat_* Both drugs share a pharmacological category prod_cat_central_nervous_system_depre β†’ +1.06
diff_* Absolute difference in a continuous property diff_monoisotopic_mass β†’ +2.17
prod_* (continuous) Element-wise product of continuous features prod_monoisotopic_mass β†’ βˆ’1.10

The prod_cat_* features are directly clinically meaningful: if both drugs are "CYP3A4 substrates", they compete for the same metabolic pathway, directly causing pharmacokinetic interactions.

GNN β€” feature ablation

SAGEConv is non-linear, so LR-style coefficient attribution does not apply directly. We use feature-group ablation and decoder component ablation to quantify what drives each prediction.

Feature ablation (warm AUROC drop when feature group is removed):

Group removed AUROC Drop
Nothing (full 980-dim) 0.9738 β€”
Structural features [0:212] 0.8993 βˆ’0.0745
PubMedBERT embeddings [212:980] 0.9661 βˆ’0.0077

Structural features dominate: physicochemical and pharmacological dimensions carry the strongest signal; BERT embeddings add a small but consistent improvement.

CN pooling ablation (which decoder components contribute):

Decoder variant AUROC Drop
Full NCN pooling 0.9738 β€”
Remove shared DDI neighbours 0.9719 βˆ’0.0019
Remove shared protein targets 0.9717 βˆ’0.0021
Remove both (plain MLP) 0.9724 βˆ’0.0014

Full tables are in docs/model_architecture.md.

Source transparency label

Every prediction carries a source field: "documented" (DrugBank verbatim hit), "gnn_predicted" (GNN score above threshold 0.43), or "not_found". This allows users to calibrate trust β€” a documented DDI includes a literature-backed description; a GNN-predicted one does not.

"Why this was flagged" box (GNN predictions)

For every gnn_predicted result, the checker UI renders a human-readable evidence summary via gnn_predictor.explain(). The box surfaces:

  • Shared protein targets β€” e.g. "Shares 3 common protein targets: CYP2C9, CYP1A2, VKORC1"
  • Shared DDI neighbours β€” e.g. "12 common DDI neighbours β€” strong graph connectivity"
  • Structural overlap β€” shared CYP substrate classifications, molecular weight proximity

These reasons derive from the same graph signals that drive the NCN decoder (shared-neighbour pooling + node features), closing the explainability loop from model internals to user-visible output. Users never see a bare probability β€” they always see why.


RM2 β€” Bias / Fairness

Script: pipeline/responsible_ml.py --section bias Output: data/evaluation/responsible_ml_bias.json Live: /responsible page, RM2 section

Finding

DrugBank interaction data is heavily skewed toward certain drug classes:

Category Mean Degree Isolated %
NERVOUS SYSTEM (best) 864 3.7 %
CARDIOVASCULAR SYSTEM 725 4.1 %
DERMATOLOGICALS 274 29.5 %
VARIOUS (worst) 213 43.2 %

Coverage ratio: 4.1Γ— (best vs. worst category).

Hub drugs

Ten drugs have > 1,700 documented interaction partners (e.g. Clozapine: 1,907). These hub drugs are largely CNS agents and CYP3A4 substrates / inhibitors. Their pharmacological profile is heavily over-represented in the training set.

Consequences for the GNN

  1. Well-studied drugs (nervous system, cardiovascular): GNN link predictor has dense neighbourhood signals β€” high expected AUC.
  2. Sparse categories (VARIOUS, DERMATOLOGICALS): many drugs have degree 0 or low degree. The GNN falls back to node features (similar to LR), not graph topology.
  3. Cold-start drugs (novel drugs with no training edges): heuristics score 0; LR and GNN must rely entirely on node features.

Per-category GNN AUC (TM6 error analysis)

Script: pipeline/responsible_ml.py --section gnn_auc Output: data/evaluation/responsible_ml_gnn_auc.json Live: /responsible page, RM2 section

run_per_category_gnn_auc() loads the warm test split (edge_split.npz), scores every test pair via gnn_predictor.predict(), groups results by the drug's top-level ATC category (l4_name), and computes AUC-ROC + average precision per category. Ξ” vs overall AUC directly exposes which drug classes the GNN underperforms on β€” confirming the training-data bias.

Other mitigations

  • Degree-stratified evaluation (low / medium / high degree buckets) is a natural next step.
  • For production, track data freshness: interactions discovered after the DrugBank v5.1 snapshot are not captured.

RM3 β€” Privacy / Data Leakage

Data Sources

All data originates from DrugBank Full Database v5.1 (CC BY-NC 4.0 licence). DrugBank is a publicly available curated database of approved drugs and their interactions.

Data element Source Identifiability
Drug structures, properties DrugBank XML Public
DDI descriptions DrugBank XML Public
ATC codes, pathways, MeSH categories DrugBank XML Public
Text embeddings (PubMedBERT) Computed from above Derived β€” no new personal data
GNN model weights Trained on DrugBank graph Derived β€” no new personal data

No patient data

The system contains zero patient records, zero clinical trial participant data, and zero electronic health records. There is no re-identification risk.

Data leakage in evaluation

The train/test split (data/evaluation/edge_split.npz) is performed on edges (DDI pairs), not on drugs. Node features are available for all drugs in both train and test β€” this is the standard transductive graph learning setting and is not leakage.

Negative pairs for evaluation are sampled uniformly at random from pairs not in DrugBank. Because DrugBank is incomplete (absence β‰  no interaction), some "negatives" may be undocumented true interactions. This is acknowledged as open-world assumption noise and is standard in the DDI literature.

Licence compliance

  • DrugBank: CC BY-NC 4.0 β€” academic / non-commercial use only.
  • PubMedBERT (pritamdeka/S-PubMedBert-MS-MARCO): Apache 2.0.
  • Groq API (llama-3.3-70b-versatile): usage subject to Groq ToS β€” free tier.
  • No proprietary data is included or distributed.

RM4 β€” Robustness / Distribution Shift

Script: pipeline/responsible_ml.py --section robustness Output: data/evaluation/responsible_ml_robust.json Live: /responsible page, RM4 section

Test battery (20 cases)

The resolve_drug() function (pipeline/ddi_query.py) is tested against realistic input variations.

Category Cases Result
Case variants (lower, upper, mixed) 4 All PASS
Common non-brand synonyms (aspirin, adrenaline) 2 PASS
DrugBank ID input 1 PASS
Whitespace variants (trailing, leading) 2 PASS
1-character typo (warrfarin) 1 PASS (correctly rejected)
Nonsense / empty / numeric 3 PASS (correctly rejected)
Brand names (Tylenol, Advil, Prozac, Lipitor) 4 PASS (via products.csv)
Drug class as input (anticoagulant) 1 PASS (correctly rejected)
Hydrochloride suffix 1 PASS (via products.csv)

Pass rate: 100 % (20 of 20 cases pass; single-character typos and empty strings correctly rejected by design).

Key findings

Correctly handled:

  • Case-insensitive matching for all canonical names
  • Common INN synonyms (aspirin β†’ Acetylsalicylic acid, adrenaline β†’ Epinephrine)
  • DrugBank IDs as primary keys
  • Whitespace stripping

By design β€” not handled:

  • Single-character typos are rejected (conservative design choice). In a clinical safety context a false positive ("banana" β†’ some drug) is more dangerous than a false negative.
  • Brand names (Tylenol, Advil, Prozac, Lipitor) β€” resolved via data/step3_approved/products.csv (473,660 product entries across 4,109 drugs). The synonym map now covers INN names, DrugBank synonyms, and commercial brand names in a single lookup.

Distribution shift β€” data currency: DrugBank v5.1 contains 4,795 approved drugs. Any drug approved after this snapshot will not be found by resolve_drug(). This is an inherent data currency limitation, not a model failure. The GNN flag (source: "gnn_predicted") partially mitigates this for novel drug pairs among existing drugs with undocumented interactions.


Generated by pipeline/responsible_ml.py. Live pages: /responsible (all RM sections) Β· /results (model performance).