MuLGIT / docs /MuLGIT_Perturb_Study_Plan.md

Upload docs/MuLGIT_Perturb_Study_Plan.md

6dc8bd4 verified 13 days ago

preview code

raw

history blame contribute delete

20.4 kB

MuLGIT-Perturb: Perturbation Layer Extension

From Static Biomarker Ranking to Counterfactual Molecular State Prediction

Status: Design Specification — Study Plan
Repository: https://huggingface.co/vedatonuryilmaz/MuLGIT
Date: 2026-05-09

1. Motivation: What Static Models Miss

MuLGIT currently identifies causal features that correlate with survival/longevity and ranks drugs by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually worsen the molecular state of a given patient. What's missing is the ability to answer:

"What will this perturbation do to this patient's molecular state?"

Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one — useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.

2. Capability Specification

Input

Baseline omics state — the patient/cell's molecular profile before perturbation:
- DNA: methylation β-values, CNV copy numbers
- RNA: mRNA expression, miRNA expression
- (Future) Chromatin: WGBS, ATAC-seq GeneActivity
Perturbation descriptor — what is being applied:
- Drug: SMILES string → molecular fingerprint or ChemBERTa embedding
- Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
- Drug combination: set of drug fingerprints + dose ratios

Output

Predicted molecular state shift (Δ) — per-gene or per-pathway expression change:
- ŷ_post = f(baseline, perturbation) → predicted post-perturbation expression
- Δ = ŷ_post - baseline → direction and magnitude of change per gene
Uncertainty quantification — per-gene prediction intervals:
- Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
- Aleatoric uncertainty via predicted variance (heteroscedastic output head)
Mechanistic attribution — why this perturbation causes this shift:
- Gene set activation scores (MSigDB pathways, Reactome, KEGG)
- Top differentially expressed genes (DEGs) with effect sizes
- Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
- Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"

3. Architecture: MuLGIT-Perturb

3.1 Design Principles

Reuse MuLGIT encoders — the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become frozen feature extractors for the perturbation layer, providing biologically-grounded latent representations.
Perturbation-conditioned residual prediction — the model predicts the difference between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).
Central dogma as structural prior — the perturbation flows through the same biological layers: drug → chromatin state → DNA methylation → mRNA expression → miRNA → predicted phenotype shift.
Uncertainty via deep ensembles — train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).
Chemical structure encoding — use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current drug_target.py.

3.2 Architecture Diagram

                        ┌──────────────────────────────────┐
                        │   PERTURBATION ENCODER            │
                        │   ┌──────────┐   ┌─────────────┐ │
     Drug SMILES ──────►│   │ MolFormer│   │ Geneformer  │◄── Genetic Pert.
                        │   │  (frozen)│   │  (frozen)   │ │
                        │   └────┬─────┘   └──────┬──────┘ │
                        │        │                 │        │
                        │   ┌────▼─────────────────▼──────┐ │
                        │   │   Fusion Gate (learned α)    │ │
                        │   │   z_pert = α·z_drug +        │ │
                        │   │           (1-α)·z_gene       │ │
                        │   └─────────────┬────────────────┘ │
                        └─────────────────┼──────────────────┘
                                          │ z_pert ∈ R^768
                                          │
    ┌─────────────────────────────────────┼─────────────────────────────┐
    │                    MULGIT (frozen feature extractor)               │
    │                                                                   │
    │  BASELINE STATE                                                   │
    │  Methylation ─► SNN Encoder ─► z_meth                             │
    │  CNV         ─► SNN Encoder ─► z_cnv                              │
    │  mRNA        ─► SNN Encoder ─► z_mrna                             │
    │  miRNA       ─► SNN Encoder ─► z_mirna                            │
    │                                                                   │
    │  CentralDogmaFusion                                               │
    │  z_dna = [z_meth, z_cnv]                                          │
    │  z_dna_rna = DNA→RNA(z_dna, z_mrna)                               │
    │  z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna)                 │
    │                    │                                              │
    └────────────────────┼──────────────────────────────────────────────┘
                         │ z_fused ∈ R^48 (baseline latent state)
                         │
    ┌────────────────────▼──────────────────────────────────────────────┐
    │              PERTURBATION RESPONSE PREDICTOR (trained)             │
    │                                                                    │
    │  ┌─────────────────────────────────────────────────────────┐      │
    │  │  Condition Fusion                                       │      │
    │  │  z_cond = FiLM(z_fused, z_pert)                        │      │
    │  │  (Feature-wise Linear Modulation: scale + shift from    │      │
    │  │   perturbation embedding applied to baseline state)     │      │
    │  └────────────────────────────┬───────────────────────────┘      │
    │                               │                                    │
    │  ┌────────────────────────────▼───────────────────────────┐      │
    │  │  Delta Decoder (SNNStack)                               │      │
    │  │  z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Δ     │      │
    │  │                                          └─► σ² (UQ)   │      │
    │  │  Δ ∈ R^G (per-gene logFC prediction)                   │      │
    │  │  σ² ∈ R^G (per-gene predicted variance)                │      │
    │  └────────────────────────────┬───────────────────────────┘      │
    │                               │                                    │
    │  ŷ_post = y_baseline + Δ                                         │
    │                                                                    │
    └───────────────────────────────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼────────────────────────────────────┐
    │                     OUTPUT LAYER                                   │
    │                                                                    │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐│
    │  │   Δ per gene │  │   σ² per gene│  │  Pathway Enrichment      ││
    │  │  (logFC)      │  │  (variance)  │  │  (GSEA on Δ)            ││
    │  └──────────────┘  └──────────────┘  └──────────────────────────┘│
    │                                                                    │
    │  ┌──────────────────────────────────────────────────────────────┐ │
    │  │  Mechanistic Attribution                                     │ │
    │  │  ┌─────────────────┐  ┌──────────────────────────────────┐   │ │
    │  │  │ Integrated       │  │ Perturbation-Conditioned        │   │ │
    │  │  │ Gradients on Δ   │  │ Knowledge Subgraph              │   │ │
    │  │  │ (gene→pathway    │  │ (AdaPert-style: GNN over        │   │ │
    │  │  │  attribution)    │  │  GO/Reactome/STRING + drug)     │   │ │
    │  │  └─────────────────┘  └──────────────────────────────────┘   │ │
    │  └──────────────────────────────────────────────────────────────┘ │
    └───────────────────────────────────────────────────────────────────┘

3.3 Key Components

3.3.1 PerturbationEncoder

Encodes both drug and genetic perturbations into a unified embedding:

Drug: SMILES → MolFormer (frozen, 768-dim) or ChemBERTa-2
Genetic: Gene symbol → Geneformer gene embedding (frozen)
Fusion: Learned convex combination z_pert = α·z_drug + (1-α)·z_gene with α ∈ [0,1]

3.3.2 FiLM Conditioning Layer

Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:

z_cond = γ(z_pert) ⊙ z_fused + β(z_pert)

More expressive than concatenation; better suited for conditioning a state representation.

3.3.3 DeltaDecoderWithUncertainty

Predicts per-gene logFC (Δ) and per-gene prediction variance (σ²):

Shared SNNStack: z_cond → [1024] → [512] → [256] → [128]
Δ head: Linear(128, n_genes) → predicted expression change
σ² head: Linear(128, n_genes) → Softplus → predicted variance (>0)

3.3.4 MuLGITPerturbEnsemble

K=3-5 identical models trained with different seeds:

Epistemic uncertainty = variance of Δ across ensemble
Aleatoric uncertainty = mean σ² from each model
Total uncertainty = epistemic + aleatoric

4. Data Strategy

4.1 Primary: Tahoe-100M

Status: ✅ Ready on HF Hub
Path: tahoebio/Tahoe-100M

Property	Value
Rows	95.6M drug-gene perturbation observations
Drugs	~1,100 (with canonical SMILES, PubChem CID, MOA)
Cell lines	50 cancer cell lines
Genes	45,134 (full transcriptome)
Matched pre/post	Yes (vehicle controls per cell line)

4.2 Supplementary Datasets

Dataset	Perturbation Type	Samples	HF Hub	Use
GDSC	Drug (251)	251K IC50	❌	IC50 validation
DepMap	Genetic (CRISPR KO)	1,000+ cell lines	❌	Genetic pert. validation
LINCS L1000	Drug + genetic	1.3M profiles	❌	Cross-dataset benchmark
Norman19	Genetic combos	91K cells	Partial	Combinatorial benchmark

4.3 Data Loading for Training

from datasets import load_dataset

# Drug metadata
drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")

# Expression data (streaming for 95.6M rows)
expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data", 
                        split="train", streaming=True)

# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id, 
#           canonical_smiles, pubchem_cid, moa-fine, sample

5. Training Objective

5.1 Primary Loss: Negative Log-Likelihood with Uncertainty

L_NLL = 0.5 · Σ_g [ log(σ²_g) + (Δ_true,g − Δ_pred,g)² / σ²_g ]

Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted.

5.2 Auxiliary Losses

Pathway consistency: L_path = MSE(GSEA(Δ_pred), GSEA(Δ_true)) — pathways should shift coherently
Sparsity: L_sparse = ||Δ_pred||₁ / (B·G) — only genes that actually respond should shift
Risk shift (optional): L_risk = MSE(risk_post − risk_baseline, risk_shift_target) — ties back to survival

5.3 Hyperparameters

Parameter	Value	Source
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.1)	Lingshu-Cell
LR schedule	2e-4 → 2e-5, cosine	Lingshu-Cell
Batch size	256	Lingshu-Cell
Mixed precision	bf16	Lingshu-Cell
MuLGIT backbone	Frozen	Transfer from TCGA
Ensemble size	K=3 models	Lakshminarayanan 2017
Epochs	50-100 (ES on val NLL)	Standard

6. Benchmark Tasks & Leakage Controls

6.1 Task Suite

Task	Data	Split	Key Metric
Drug response prediction	Tahoe-100M	Leave-drug-out (20%)	DES@50, Pearson-Δ
Genetic perturbation	DepMap/Norman19	Leave-gene-out (20%)	DES@50, Direction-match
Combinatorial pert.	Norman19	Leave-combo-out (70%)	PDS, compositionality
Drug sensitivity (IC50)	GDSC	Leave-cell-line-out	Pearson r, Spearman ρ
Pathway rescue	Tahoe-100M (aging)	Leave-pathway-out	AUC of rescue

6.2 Metrics

Metric	What It Measures	Range	Target
DES@K	Fraction of true DEGs in top-K predicted	[0,1]	Higher
Pearson-Δ	Correlation of predicted vs true Δ	[-1,1]	Higher
Direction-match	Fraction of genes with correct sign	[0,1]	Higher
PDS	Discrimination between perturbations	[0,1]	Higher
MMD	Distributional fidelity in PCA space	[0,∞)	Lower
RMSE	Raw expression reconstruction error	[0,∞)	Lower

7. Comparison to Existing Baselines

Method	Gene Coverage	Drug Input	UQ	Attribution	HF-Ready
CPA	HVG (1-4K)	One-hot ID	Latent sampling	❌	❌
GEARS	All genes	One-hot ID	❌	Gene graph edges	❌
Lingshu-Cell	All 18K	One-hot ID	❌	❌	❌
AdaPert	HVG only	Semantic embed	❌	✅ KG subgraph	❌
MuLGIT-Perturb	All 45K+	SMILES→MolFormer	✅ Ensemble	✅ IG + KG	✅

8. Implementation Plan

8.1 New Files

mulgit/
├── perturb/
│   ├── __init__.py
│   ├── encoder.py          # PerturbationEncoder (drug + gene)
│   ├── conditioning.py     # FiLMConditioning
│   ├── decoder.py          # DeltaDecoderWithUncertainty
│   ├── model.py            # MuLGITPerturb
│   ├── ensemble.py         # MuLGITPerturbEnsemble
│   ├── attribution.py      # Integrated Gradients + KG subgraph
│   ├── losses.py           # NLL + pathway consistency + sparsity
│   ├── data.py             # Tahoe-100M perturbation data loader
│   ├── trainer.py          # Training loop with Trackio
│   ├── evaluate.py         # DES@K, PDS, Pearson-Δ, etc.
│   └── config.py           # MuLGITPerturbConfig

8.2 Five-Phase Execution

Phase	Duration	Deliverable
1. Tahoe-100M drug response	2 weeks	Trained model + all benchmarks on leave-drug-out
2. Cross-dataset generalization	1 week	Zero-shot evaluation on LINCS, fine-tune if needed
3. Genetic perturbations	1 week	DepMap KO prediction, compare to Geneformer
4. Combinatorial perturbation	1 week	Norman19 combos, PDS evaluation
5. Longevity-specific	1 week	Aging pathway rescue, drug ranking vs static screener

Total: 6 weeks from start to full validation.

9. Example Output

{
  "perturbation": "Rapamycin",
  "cell_line": "A549",
  "baseline_risk": 0.342,
  "predicted_post_risk": -0.187,
  "risk_shift": -0.529,
  
  "top_degs": [
    {"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
    {"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
    {"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
  ],
  
  "enriched_pathways": [
    {"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
    {"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
  ],
  
  "aging_signature_reversal": 0.72,
  "epistemic_uncertainty": 0.08,
  "aleatoric_uncertainty": 0.15
}

10. Risk Assessment

Risk	Mitigation
Tahoe-100M streaming slow (95.6M rows)	Local cache after first pass; pre-filter to 10M most informative rows
MolFormer fails for some SMILES	Fallback to Morgan fingerprints (RDKit already in deps)
MuLGIT backbone overfits to TCGA	Include tissue type as conditioning; evaluate zero-shot on LINCS
GPU memory for 45K-gene output	Start with L1000 landmarks (978); scale after validation
Combinatorial too hard	Phase as stretch goal; single-perturbation alone is impactful

References

Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
Li et al., "AdaPert," arxiv:2602.18885 (2025)
He et al., "PerturBench," arxiv:2408.10609 (2024)
El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
Ross et al., "MolFormer," Nature Machine Intelligence 2022
Theodoris et al., "Geneformer," Nature 2023
Perez et al., "FiLM," AAAI 2018
Kendall & Gal, "Bayesian UQ," NeurIPS 2017
Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024