MuLGIT / docs /MuLGIT_Perturb_Study_Plan.md

Upload docs/MuLGIT_Perturb_Study_Plan.md

6dc8bd4 verified 14 days ago

20.4 kB

	# MuLGIT-Perturb: Perturbation Layer Extension
	## From Static Biomarker Ranking to Counterfactual Molecular State Prediction

	Status: Design Specification — Study Plan
	Repository: https://huggingface.co/vedatonuryilmaz/MuLGIT
	Date: 2026-05-09

	---

	## 1. Motivation: What Static Models Miss

	MuLGIT currently identifies causal features that correlate with survival/longevity and ranks drugs by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually worsen the molecular state of a given patient. What's missing is the ability to answer:

	> "What will this perturbation do to this patient's molecular state?"

	Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one — useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.

	---

	## 2. Capability Specification

	### Input
	1. Baseline omics state — the patient/cell's molecular profile before perturbation:
	- DNA: methylation β-values, CNV copy numbers
	- RNA: mRNA expression, miRNA expression
	- (Future) Chromatin: WGBS, ATAC-seq GeneActivity
	2. Perturbation descriptor — what is being applied:
	- Drug: SMILES string → molecular fingerprint or ChemBERTa embedding
	- Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
	- Drug combination: set of drug fingerprints + dose ratios

	### Output
	1. Predicted molecular state shift (Δ) — per-gene or per-pathway expression change:
	- `ŷ_post = f(baseline, perturbation)` → predicted post-perturbation expression
	- `Δ = ŷ_post - baseline` → direction and magnitude of change per gene
	2. Uncertainty quantification — per-gene prediction intervals:
	- Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
	- Aleatoric uncertainty via predicted variance (heteroscedastic output head)
	3. Mechanistic attribution — why this perturbation causes this shift:
	- Gene set activation scores (MSigDB pathways, Reactome, KEGG)
	- Top differentially expressed genes (DEGs) with effect sizes
	- Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
	- Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"

	---

	## 3. Architecture: MuLGIT-Perturb

	### 3.1 Design Principles

	1. Reuse MuLGIT encoders — the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become frozen feature extractors for the perturbation layer, providing biologically-grounded latent representations.

	2. Perturbation-conditioned residual prediction — the model predicts the difference between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).

	3. Central dogma as structural prior — the perturbation flows through the same biological layers: drug → chromatin state → DNA methylation → mRNA expression → miRNA → predicted phenotype shift.

	4. Uncertainty via deep ensembles — train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).

	5. Chemical structure encoding — use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current `drug_target.py`.

	### 3.2 Architecture Diagram

	```
	┌──────────────────────────────────┐
	│ PERTURBATION ENCODER │
	│ ┌──────────┐ ┌─────────────┐ │
	Drug SMILES ──────►│ │ MolFormer│ │ Geneformer │◄── Genetic Pert.
	│ │ (frozen)│ │ (frozen) │ │
	│ └────┬─────┘ └──────┬──────┘ │
	│ │ │ │
	│ ┌────▼─────────────────▼──────┐ │
	│ │ Fusion Gate (learned α) │ │
	│ │ z_pert = α·z_drug + │ │
	│ │ (1-α)·z_gene │ │
	│ └─────────────┬────────────────┘ │
	└─────────────────┼──────────────────┘
	│ z_pert ∈ R^768
	│
	┌─────────────────────────────────────┼─────────────────────────────┐
	│ MULGIT (frozen feature extractor) │
	│ │
	│ BASELINE STATE │
	│ Methylation ─► SNN Encoder ─► z_meth │
	│ CNV ─► SNN Encoder ─► z_cnv │
	│ mRNA ─► SNN Encoder ─► z_mrna │
	│ miRNA ─► SNN Encoder ─► z_mirna │
	│ │
	│ CentralDogmaFusion │
	│ z_dna = [z_meth, z_cnv] │
	│ z_dna_rna = DNA→RNA(z_dna, z_mrna) │
	│ z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna) │
	│ │ │
	└────────────────────┼──────────────────────────────────────────────┘
	│ z_fused ∈ R^48 (baseline latent state)
	│
	┌────────────────────▼──────────────────────────────────────────────┐
	│ PERTURBATION RESPONSE PREDICTOR (trained) │
	│ │
	│ ┌─────────────────────────────────────────────────────────┐ │
	│ │ Condition Fusion │ │
	│ │ z_cond = FiLM(z_fused, z_pert) │ │
	│ │ (Feature-wise Linear Modulation: scale + shift from │ │
	│ │ perturbation embedding applied to baseline state) │ │
	│ └────────────────────────────┬───────────────────────────┘ │
	│ │ │
	│ ┌────────────────────────────▼───────────────────────────┐ │
	│ │ Delta Decoder (SNNStack) │ │
	│ │ z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Δ │ │
	│ │ └─► σ² (UQ) │ │
	│ │ Δ ∈ R^G (per-gene logFC prediction) │ │
	│ │ σ² ∈ R^G (per-gene predicted variance) │ │
	│ └────────────────────────────┬───────────────────────────┘ │
	│ │ │
	│ ŷ_post = y_baseline + Δ │
	│ │
	└───────────────────────────────────────────────────────────────────┘
	│
	┌──────────────────────────────▼────────────────────────────────────┐
	│ OUTPUT LAYER │
	│ │
	│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
	│ │ Δ per gene │ │ σ² per gene│ │ Pathway Enrichment ││
	│ │ (logFC) │ │ (variance) │ │ (GSEA on Δ) ││
	│ └──────────────┘ └──────────────┘ └──────────────────────────┘│
	│ │
	│ ┌──────────────────────────────────────────────────────────────┐ │
	│ │ Mechanistic Attribution │ │
	│ │ ┌─────────────────┐ ┌──────────────────────────────────┐ │ │
	│ │ │ Integrated │ │ Perturbation-Conditioned │ │ │
	│ │ │ Gradients on Δ │ │ Knowledge Subgraph │ │ │
	│ │ │ (gene→pathway │ │ (AdaPert-style: GNN over │ │ │
	│ │ │ attribution) │ │ GO/Reactome/STRING + drug) │ │ │
	│ │ └─────────────────┘ └──────────────────────────────────┘ │ │
	│ └──────────────────────────────────────────────────────────────┘ │
	└───────────────────────────────────────────────────────────────────┘
	```

	### 3.3 Key Components

	#### 3.3.1 PerturbationEncoder

	Encodes both drug and genetic perturbations into a unified embedding:
	- Drug: SMILES → MolFormer (frozen, 768-dim) or ChemBERTa-2
	- Genetic: Gene symbol → Geneformer gene embedding (frozen)
	- Fusion: Learned convex combination `z_pert = α·z_drug + (1-α)·z_gene` with `α ∈ [0,1]`

	#### 3.3.2 FiLM Conditioning Layer

	Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:
	```
	z_cond = γ(z_pert) ⊙ z_fused + β(z_pert)
	```
	More expressive than concatenation; better suited for conditioning a state representation.

	#### 3.3.3 DeltaDecoderWithUncertainty

	Predicts per-gene logFC (Δ) and per-gene prediction variance (σ²):
	- Shared SNNStack: z_cond → [1024] → [512] → [256] → [128]
	- Δ head: Linear(128, n_genes) → predicted expression change
	- σ² head: Linear(128, n_genes) → Softplus → predicted variance (>0)

	#### 3.3.4 MuLGITPerturbEnsemble

	K=3-5 identical models trained with different seeds:
	- Epistemic uncertainty = variance of Δ across ensemble
	- Aleatoric uncertainty = mean σ² from each model
	- Total uncertainty = epistemic + aleatoric

	---

	## 4. Data Strategy

	### 4.1 Primary: Tahoe-100M

	Status: ✅ Ready on HF Hub
	Path: `tahoebio/Tahoe-100M`

	\| Property \| Value \|
	\|----------\|-------\|
	\| Rows \| 95.6M drug-gene perturbation observations \|
	\| Drugs \| ~1,100 (with canonical SMILES, PubChem CID, MOA) \|
	\| Cell lines \| 50 cancer cell lines \|
	\| Genes \| 45,134 (full transcriptome) \|
	\| Matched pre/post \| Yes (vehicle controls per cell line) \|

	### 4.2 Supplementary Datasets

	\| Dataset \| Perturbation Type \| Samples \| HF Hub \| Use \|
	\|---------\|-------------------\|---------\|--------\|-----\|
	\| GDSC \| Drug (251) \| 251K IC50 \| ❌ \| IC50 validation \|
	\| DepMap \| Genetic (CRISPR KO) \| 1,000+ cell lines \| ❌ \| Genetic pert. validation \|
	\| LINCS L1000 \| Drug + genetic \| 1.3M profiles \| ❌ \| Cross-dataset benchmark \|
	\| Norman19 \| Genetic combos \| 91K cells \| Partial \| Combinatorial benchmark \|

	### 4.3 Data Loading for Training

	```python
	from datasets import load_dataset

	# Drug metadata
	drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")

	# Expression data (streaming for 95.6M rows)
	expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data",
	split="train", streaming=True)

	# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id,
	# canonical_smiles, pubchem_cid, moa-fine, sample
	```

	---

	## 5. Training Objective

	### 5.1 Primary Loss: Negative Log-Likelihood with Uncertainty

	```
	L_NLL = 0.5 · Σ_g [ log(σ²_g) + (Δ_true,g − Δ_pred,g)² / σ²_g ]
	```
	Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted.

	### 5.2 Auxiliary Losses

	1. Pathway consistency: `L_path = MSE(GSEA(Δ_pred), GSEA(Δ_true))` — pathways should shift coherently
	2. Sparsity: `L_sparse = \|\|Δ_pred\|\|₁ / (B·G)` — only genes that actually respond should shift
	3. Risk shift (optional): `L_risk = MSE(risk_post − risk_baseline, risk_shift_target)` — ties back to survival

	### 5.3 Hyperparameters

	\| Parameter \| Value \| Source \|
	\|-----------\|-------\|--------\|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, wd=0.1) \| Lingshu-Cell \|
	\| LR schedule \| 2e-4 → 2e-5, cosine \| Lingshu-Cell \|
	\| Batch size \| 256 \| Lingshu-Cell \|
	\| Mixed precision \| bf16 \| Lingshu-Cell \|
	\| MuLGIT backbone \| Frozen \| Transfer from TCGA \|
	\| Ensemble size \| K=3 models \| Lakshminarayanan 2017 \|
	\| Epochs \| 50-100 (ES on val NLL) \| Standard \|

	---

	## 6. Benchmark Tasks & Leakage Controls

	### 6.1 Task Suite

	\| Task \| Data \| Split \| Key Metric \|
	\|------\|------\|-------\|------------\|
	\| Drug response prediction \| Tahoe-100M \| Leave-drug-out (20%) \| DES@50, Pearson-Δ \|
	\| Genetic perturbation \| DepMap/Norman19 \| Leave-gene-out (20%) \| DES@50, Direction-match \|
	\| Combinatorial pert. \| Norman19 \| Leave-combo-out (70%) \| PDS, compositionality \|
	\| Drug sensitivity (IC50) \| GDSC \| Leave-cell-line-out \| Pearson r, Spearman ρ \|
	\| Pathway rescue \| Tahoe-100M (aging) \| Leave-pathway-out \| AUC of rescue \|

	### 6.2 Metrics

	\| Metric \| What It Measures \| Range \| Target \|
	\|--------\|-----------------\|-------\|--------\|
	\| DES@K \| Fraction of true DEGs in top-K predicted \| [0,1] \| Higher \|
	\| Pearson-Δ \| Correlation of predicted vs true Δ \| [-1,1] \| Higher \|
	\| Direction-match \| Fraction of genes with correct sign \| [0,1] \| Higher \|
	\| PDS \| Discrimination between perturbations \| [0,1] \| Higher \|
	\| MMD \| Distributional fidelity in PCA space \| [0,∞) \| Lower \|
	\| RMSE \| Raw expression reconstruction error \| [0,∞) \| Lower \|

	---

	## 7. Comparison to Existing Baselines

	\| Method \| Gene Coverage \| Drug Input \| UQ \| Attribution \| HF-Ready \|
	\|--------\|--------------\|------------\|-----\|-------------\|----------\|
	\| CPA \| HVG (1-4K) \| One-hot ID \| Latent sampling \| ❌ \| ❌ \|
	\| GEARS \| All genes \| One-hot ID \| ❌ \| Gene graph edges \| ❌ \|
	\| Lingshu-Cell \| All 18K \| One-hot ID \| ❌ \| ❌ \| ❌ \|
	\| AdaPert \| HVG only \| Semantic embed \| ❌ \| ✅ KG subgraph \| ❌ \|
	\| MuLGIT-Perturb \| All 45K+ \| SMILES→MolFormer \| ✅ Ensemble \| ✅ IG + KG \| ✅ \|

	---

	## 8. Implementation Plan

	### 8.1 New Files

	```
	mulgit/
	├── perturb/
	│ ├── __init__.py
	│ ├── encoder.py # PerturbationEncoder (drug + gene)
	│ ├── conditioning.py # FiLMConditioning
	│ ├── decoder.py # DeltaDecoderWithUncertainty
	│ ├── model.py # MuLGITPerturb
	│ ├── ensemble.py # MuLGITPerturbEnsemble
	│ ├── attribution.py # Integrated Gradients + KG subgraph
	│ ├── losses.py # NLL + pathway consistency + sparsity
	│ ├── data.py # Tahoe-100M perturbation data loader
	│ ├── trainer.py # Training loop with Trackio
	│ ├── evaluate.py # DES@K, PDS, Pearson-Δ, etc.
	│ └── config.py # MuLGITPerturbConfig
	```

	### 8.2 Five-Phase Execution

	\| Phase \| Duration \| Deliverable \|
	\|-------\|----------\|-------------\|
	\| 1. Tahoe-100M drug response \| 2 weeks \| Trained model + all benchmarks on leave-drug-out \|
	\| 2. Cross-dataset generalization \| 1 week \| Zero-shot evaluation on LINCS, fine-tune if needed \|
	\| 3. Genetic perturbations \| 1 week \| DepMap KO prediction, compare to Geneformer \|
	\| 4. Combinatorial perturbation \| 1 week \| Norman19 combos, PDS evaluation \|
	\| 5. Longevity-specific \| 1 week \| Aging pathway rescue, drug ranking vs static screener \|

	Total: 6 weeks from start to full validation.

	---

	## 9. Example Output

	```json
	{
	"perturbation": "Rapamycin",
	"cell_line": "A549",
	"baseline_risk": 0.342,
	"predicted_post_risk": -0.187,
	"risk_shift": -0.529,

	"top_degs": [
	{"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
	{"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
	{"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
	],

	"enriched_pathways": [
	{"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
	{"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
	],

	"aging_signature_reversal": 0.72,
	"epistemic_uncertainty": 0.08,
	"aleatoric_uncertainty": 0.15
	}
	```

	---

	## 10. Risk Assessment

	\| Risk \| Mitigation \|
	\|------\|------------\|
	\| Tahoe-100M streaming slow (95.6M rows) \| Local cache after first pass; pre-filter to 10M most informative rows \|
	\| MolFormer fails for some SMILES \| Fallback to Morgan fingerprints (RDKit already in deps) \|
	\| MuLGIT backbone overfits to TCGA \| Include tissue type as conditioning; evaluate zero-shot on LINCS \|
	\| GPU memory for 45K-gene output \| Start with L1000 landmarks (978); scale after validation \|
	\| Combinatorial too hard \| Phase as stretch goal; single-perturbation alone is impactful \|

	---

	## References

	1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
	2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
	3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
	4. Li et al., "AdaPert," arxiv:2602.18885 (2025)
	5. He et al., "PerturBench," arxiv:2408.10609 (2024)
	6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
	7. Ross et al., "MolFormer," Nature Machine Intelligence 2022
	8. Theodoris et al., "Geneformer," Nature 2023
	9. Perez et al., "FiLM," AAAI 2018
	10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017
	11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
	12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024