Spaces:

marwadeeb
/

ddi-checker

Sleeping

App Files Files Community

ddi-checker / docs /responsible_ml.md

marwadeeb

final edit, hopefully

522cb9b about 1 month ago

preview code

raw

history blame contribute delete

9.7 kB

Responsible ML Analysis

DDI Checker — DrugBank interaction graph (4,795 drugs · 824,249 pairs)

We address all 4 required Responsible ML topics (RM1–RM4). RM2 and RM4 are fully automated; results are regenerated by pipeline/responsible_ml.py.

RM1 — Explainability

Dict Lookup (primary path)

The DrugBank dictionary lookup is fully interpretable by design. When a pair is found, the system returns the exact natural-language description stored in DrugBank:

"Acetylsalicylic acid may increase the anticoagulant activities of Warfarin resulting in
an increased risk of bleeding and haemorrhage."

No black-box reasoning is involved. The user can trace the result directly to its primary source.

Logistic Regression (non-graph baseline — TM10G)

Logistic Regression produces coefficient-level feature attribution. The top features (by |coefficient|) are available in data/evaluation/baselines_results.json and displayed on the /results page. Key findings:

Feature prefix	Meaning	Top example
`prod_cat_*`	Both drugs share a pharmacological category	`prod_cat_central_nervous_system_depre` → +1.06
`diff_*`	Absolute difference in a continuous property	`diff_monoisotopic_mass` → +2.17
`prod_*` (continuous)	Element-wise product of continuous features	`prod_monoisotopic_mass` → −1.10

The prod_cat_* features are directly clinically meaningful: if both drugs are "CYP3A4 substrates", they compete for the same metabolic pathway, directly causing pharmacokinetic interactions.

GNN — feature ablation

SAGEConv is non-linear, so LR-style coefficient attribution does not apply directly. We use feature-group ablation and decoder component ablation to quantify what drives each prediction.

Feature ablation (warm AUROC drop when feature group is removed):

Group removed	AUROC	Drop
Nothing (full 980-dim)	0.9738	—
Structural features [0:212]	0.8993	−0.0745
PubMedBERT embeddings [212:980]	0.9661	−0.0077

Structural features dominate: physicochemical and pharmacological dimensions carry the strongest signal; BERT embeddings add a small but consistent improvement.

CN pooling ablation (which decoder components contribute):

Decoder variant	AUROC	Drop
Full NCN pooling	0.9738	—
Remove shared DDI neighbours	0.9719	−0.0019
Remove shared protein targets	0.9717	−0.0021
Remove both (plain MLP)	0.9724	−0.0014

Full tables are in docs/model_architecture.md.

Source transparency label

Every prediction carries a source field: "documented" (DrugBank verbatim hit), "gnn_predicted" (GNN score above threshold 0.43), or "not_found". This allows users to calibrate trust — a documented DDI includes a literature-backed description; a GNN-predicted one does not.

"Why this was flagged" box (GNN predictions)

For every gnn_predicted result, the checker UI renders a human-readable evidence summary via gnn_predictor.explain(). The box surfaces:

Shared protein targets — e.g. "Shares 3 common protein targets: CYP2C9, CYP1A2, VKORC1"
Shared DDI neighbours — e.g. "12 common DDI neighbours — strong graph connectivity"
Structural overlap — shared CYP substrate classifications, molecular weight proximity

These reasons derive from the same graph signals that drive the NCN decoder (shared-neighbour pooling + node features), closing the explainability loop from model internals to user-visible output. Users never see a bare probability — they always see why.

RM2 — Bias / Fairness

Script: pipeline/responsible_ml.py --section bias Output: data/evaluation/responsible_ml_bias.json Live: /responsible page, RM2 section

Finding

DrugBank interaction data is heavily skewed toward certain drug classes:

Category	Mean Degree	Isolated %
NERVOUS SYSTEM (best)	864	3.7 %
CARDIOVASCULAR SYSTEM	725	4.1 %
DERMATOLOGICALS	274	29.5 %
VARIOUS (worst)	213	43.2 %

Coverage ratio: 4.1× (best vs. worst category).

Hub drugs

Ten drugs have > 1,700 documented interaction partners (e.g. Clozapine: 1,907). These hub drugs are largely CNS agents and CYP3A4 substrates / inhibitors. Their pharmacological profile is heavily over-represented in the training set.

Consequences for the GNN

Well-studied drugs (nervous system, cardiovascular): GNN link predictor has dense neighbourhood signals — high expected AUC.
Sparse categories (VARIOUS, DERMATOLOGICALS): many drugs have degree 0 or low degree. The GNN falls back to node features (similar to LR), not graph topology.
Cold-start drugs (novel drugs with no training edges): heuristics score 0; LR and GNN must rely entirely on node features.

Per-category GNN AUC (TM6 error analysis)

Script: pipeline/responsible_ml.py --section gnn_auc Output: data/evaluation/responsible_ml_gnn_auc.json Live: /responsible page, RM2 section

run_per_category_gnn_auc() loads the warm test split (edge_split.npz), scores every test pair via gnn_predictor.predict(), groups results by the drug's top-level ATC category (l4_name), and computes AUC-ROC + average precision per category. Δ vs overall AUC directly exposes which drug classes the GNN underperforms on — confirming the training-data bias.

Other mitigations

Degree-stratified evaluation (low / medium / high degree buckets) is a natural next step.
For production, track data freshness: interactions discovered after the DrugBank v5.1 snapshot are not captured.

RM3 — Privacy / Data Leakage

Data Sources

All data originates from DrugBank Full Database v5.1 (CC BY-NC 4.0 licence). DrugBank is a publicly available curated database of approved drugs and their interactions.

Data element	Source	Identifiability
Drug structures, properties	DrugBank XML	Public
DDI descriptions	DrugBank XML	Public
ATC codes, pathways, MeSH categories	DrugBank XML	Public
Text embeddings (PubMedBERT)	Computed from above	Derived — no new personal data
GNN model weights	Trained on DrugBank graph	Derived — no new personal data

No patient data

The system contains zero patient records, zero clinical trial participant data, and zero electronic health records. There is no re-identification risk.

Data leakage in evaluation

The train/test split (data/evaluation/edge_split.npz) is performed on edges (DDI pairs), not on drugs. Node features are available for all drugs in both train and test — this is the standard transductive graph learning setting and is not leakage.

Negative pairs for evaluation are sampled uniformly at random from pairs not in DrugBank. Because DrugBank is incomplete (absence ≠ no interaction), some "negatives" may be undocumented true interactions. This is acknowledged as open-world assumption noise and is standard in the DDI literature.

Licence compliance

DrugBank: CC BY-NC 4.0 — academic / non-commercial use only.
PubMedBERT (pritamdeka/S-PubMedBert-MS-MARCO): Apache 2.0.
Groq API (llama-3.3-70b-versatile): usage subject to Groq ToS — free tier.
No proprietary data is included or distributed.

RM4 — Robustness / Distribution Shift

Script: pipeline/responsible_ml.py --section robustness Output: data/evaluation/responsible_ml_robust.json Live: /responsible page, RM4 section

Test battery (20 cases)

The resolve_drug() function (pipeline/ddi_query.py) is tested against realistic input variations.

Category	Cases	Result
Case variants (lower, upper, mixed)	4	All PASS
Common non-brand synonyms (aspirin, adrenaline)	2	PASS
DrugBank ID input	1	PASS
Whitespace variants (trailing, leading)	2	PASS
1-character typo (warrfarin)	1	PASS (correctly rejected)
Nonsense / empty / numeric	3	PASS (correctly rejected)
Brand names (Tylenol, Advil, Prozac, Lipitor)	4	PASS (via `products.csv`)
Drug class as input (anticoagulant)	1	PASS (correctly rejected)
Hydrochloride suffix	1	PASS (via `products.csv`)

Pass rate: 100 % (20 of 20 cases pass; single-character typos and empty strings correctly rejected by design).

Key findings

Correctly handled:

Case-insensitive matching for all canonical names
Common INN synonyms (aspirin → Acetylsalicylic acid, adrenaline → Epinephrine)
DrugBank IDs as primary keys
Whitespace stripping

By design — not handled:

Single-character typos are rejected (conservative design choice). In a clinical safety context a false positive ("banana" → some drug) is more dangerous than a false negative.
Brand names (Tylenol, Advil, Prozac, Lipitor) — resolved via data/step3_approved/products.csv (473,660 product entries across 4,109 drugs). The synonym map now covers INN names, DrugBank synonyms, and commercial brand names in a single lookup.

Distribution shift — data currency: DrugBank v5.1 contains 4,795 approved drugs. Any drug approved after this snapshot will not be found by resolve_drug(). This is an inherent data currency limitation, not a model failure. The GNN flag (source: "gnn_predicted") partially mitigates this for novel drug pairs among existing drugs with undocumented interactions.

Generated by pipeline/responsible_ml.py. Live pages: /responsible (all RM sections) · /results (model performance).