# MuLGIT-Perturb: Perturbation Layer Extension
## From Static Biomarker Ranking to Counterfactual Molecular State Prediction

**Status:** Design Specification — Study Plan  
**Repository:** https://huggingface.co/vedatonuryilmaz/MuLGIT  
**Date:** 2026-05-09  

---

## 1. Motivation: What Static Models Miss

MuLGIT currently identifies **causal features** that correlate with survival/longevity and **ranks drugs** by their static properties (MOA, clinical status, drug-likeness). But correlation is not causation; a drug ranked highly by MOA relevance may actually **worsen** the molecular state of a given patient. What's missing is the ability to answer:

> **"What will this perturbation do to this patient's molecular state?"**

Moving beyond static ranking to counterfactual prediction transforms MuLGIT from a descriptive tool into a prescriptive one — useful for drug repurposing, mechanism inference, target validation, and personalized treatment-response prioritization.

---

## 2. Capability Specification

### Input
1. **Baseline omics state** — the patient/cell's molecular profile before perturbation:
   - DNA: methylation β-values, CNV copy numbers
   - RNA: mRNA expression, miRNA expression
   - (Future) Chromatin: WGBS, ATAC-seq GeneActivity
2. **Perturbation descriptor** — what is being applied:
   - Drug: SMILES string → molecular fingerprint or ChemBERTa embedding
   - Genetic perturbation: gene ID + perturbation type (CRISPR KO, CRISPRi, OE, shRNA)
   - Drug combination: set of drug fingerprints + dose ratios

### Output
1. **Predicted molecular state shift (Δ)** — per-gene or per-pathway expression change:
   - `ŷ_post = f(baseline, perturbation)` → predicted post-perturbation expression
   - `Δ = ŷ_post - baseline` → direction and magnitude of change per gene
2. **Uncertainty quantification** — per-gene prediction intervals:
   - Epistemic uncertainty via deep ensembles (3-5 models with different seeds)
   - Aleatoric uncertainty via predicted variance (heteroscedastic output head)
3. **Mechanistic attribution** — why this perturbation causes this shift:
   - Gene set activation scores (MSigDB pathways, Reactome, KEGG)
   - Top differentially expressed genes (DEGs) with effect sizes
   - Perturbation-conditioned subgraph of activated molecular interactions (from biological knowledge graphs)
   - Counterfactual decomposition: "If we blocked pathway X, the effect on gene Y would change by Z"

---

## 3. Architecture: MuLGIT-Perturb

### 3.1 Design Principles

1. **Reuse MuLGIT encoders** — the CentralDogmaFusion SNN encoders for methylation, CNV, mRNA, miRNA are already trained on TCGA survival prediction. These become **frozen feature extractors** for the perturbation layer, providing biologically-grounded latent representations.

2. **Perturbation-conditioned residual prediction** — the model predicts the **difference** between pre- and post-perturbation molecular states, not the absolute state. This is the standard in the field (CPA, GEARS, Lingshu-Cell all use delta prediction).

3. **Central dogma as structural prior** — the perturbation flows through the same biological layers: drug → chromatin state → DNA methylation → mRNA expression → miRNA → predicted phenotype shift.

4. **Uncertainty via deep ensembles** — train 3-5 identical-architecture models with different random seeds. Predicted variance across ensemble members gives epistemic uncertainty. This is the most robust approach validated in the field (Lakshminarayanan et al., NeurIPS 2017).

5. **Chemical structure encoding** — use MolFormer or ChemBERTa-2 (pretrained, frozen) to encode drug SMILES into fixed-size embeddings, replacing the simple 1D-CNN in the current `drug_target.py`.

### 3.2 Architecture Diagram

```
                        ┌──────────────────────────────────┐
                        │   PERTURBATION ENCODER            │
                        │   ┌──────────┐   ┌─────────────┐ │
     Drug SMILES ──────►│   │ MolFormer│   │ Geneformer  │◄── Genetic Pert.
                        │   │  (frozen)│   │  (frozen)   │ │
                        │   └────┬─────┘   └──────┬──────┘ │
                        │        │                 │        │
                        │   ┌────▼─────────────────▼──────┐ │
                        │   │   Fusion Gate (learned α)    │ │
                        │   │   z_pert = α·z_drug +        │ │
                        │   │           (1-α)·z_gene       │ │
                        │   └─────────────┬────────────────┘ │
                        └─────────────────┼──────────────────┘
                                          │ z_pert ∈ R^768
                                          │
    ┌─────────────────────────────────────┼─────────────────────────────┐
    │                    MULGIT (frozen feature extractor)               │
    │                                                                   │
    │  BASELINE STATE                                                   │
    │  Methylation ─► SNN Encoder ─► z_meth                             │
    │  CNV         ─► SNN Encoder ─► z_cnv                              │
    │  mRNA        ─► SNN Encoder ─► z_mrna                             │
    │  miRNA       ─► SNN Encoder ─► z_mirna                            │
    │                                                                   │
    │  CentralDogmaFusion                                               │
    │  z_dna = [z_meth, z_cnv]                                          │
    │  z_dna_rna = DNA→RNA(z_dna, z_mrna)                               │
    │  z_fused = FullFusion(z_dna_rna, z_mrna, z_mirna)                 │
    │                    │                                              │
    └────────────────────┼──────────────────────────────────────────────┘
                         │ z_fused ∈ R^48 (baseline latent state)
                         │
    ┌────────────────────▼──────────────────────────────────────────────┐
    │              PERTURBATION RESPONSE PREDICTOR (trained)             │
    │                                                                    │
    │  ┌─────────────────────────────────────────────────────────┐      │
    │  │  Condition Fusion                                       │      │
    │  │  z_cond = FiLM(z_fused, z_pert)                        │      │
    │  │  (Feature-wise Linear Modulation: scale + shift from    │      │
    │  │   perturbation embedding applied to baseline state)     │      │
    │  └────────────────────────────┬───────────────────────────┘      │
    │                               │                                    │
    │  ┌────────────────────────────▼───────────────────────────┐      │
    │  │  Delta Decoder (SNNStack)                               │      │
    │  │  z_cond ─► [1024] ─► [512] ─► [256] ─► [128] ─► Δ     │      │
    │  │                                          └─► σ² (UQ)   │      │
    │  │  Δ ∈ R^G (per-gene logFC prediction)                   │      │
    │  │  σ² ∈ R^G (per-gene predicted variance)                │      │
    │  └────────────────────────────┬───────────────────────────┘      │
    │                               │                                    │
    │  ŷ_post = y_baseline + Δ                                         │
    │                                                                    │
    └───────────────────────────────────────────────────────────────────┘
                                   │
    ┌──────────────────────────────▼────────────────────────────────────┐
    │                     OUTPUT LAYER                                   │
    │                                                                    │
    │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐│
    │  │   Δ per gene │  │   σ² per gene│  │  Pathway Enrichment      ││
    │  │  (logFC)      │  │  (variance)  │  │  (GSEA on Δ)            ││
    │  └──────────────┘  └──────────────┘  └──────────────────────────┘│
    │                                                                    │
    │  ┌──────────────────────────────────────────────────────────────┐ │
    │  │  Mechanistic Attribution                                     │ │
    │  │  ┌─────────────────┐  ┌──────────────────────────────────┐   │ │
    │  │  │ Integrated       │  │ Perturbation-Conditioned        │   │ │
    │  │  │ Gradients on Δ   │  │ Knowledge Subgraph              │   │ │
    │  │  │ (gene→pathway    │  │ (AdaPert-style: GNN over        │   │ │
    │  │  │  attribution)    │  │  GO/Reactome/STRING + drug)     │   │ │
    │  │  └─────────────────┘  └──────────────────────────────────┘   │ │
    │  └──────────────────────────────────────────────────────────────┘ │
    └───────────────────────────────────────────────────────────────────┘
```

### 3.3 Key Components

#### 3.3.1 PerturbationEncoder

Encodes both drug and genetic perturbations into a unified embedding:
- Drug: SMILES → MolFormer (frozen, 768-dim) or ChemBERTa-2
- Genetic: Gene symbol → Geneformer gene embedding (frozen)
- Fusion: Learned convex combination `z_pert = α·z_drug + (1-α)·z_gene` with `α ∈ [0,1]`

#### 3.3.2 FiLM Conditioning Layer

Feature-wise Linear Modulation applies perturbation as scale + shift to baseline state:
```
z_cond = γ(z_pert) ⊙ z_fused + β(z_pert)
```
More expressive than concatenation; better suited for conditioning a state representation.

#### 3.3.3 DeltaDecoderWithUncertainty

Predicts per-gene logFC (Δ) and per-gene prediction variance (σ²):
- Shared SNNStack: z_cond → [1024] → [512] → [256] → [128]
- Δ head: Linear(128, n_genes) → predicted expression change
- σ² head: Linear(128, n_genes) → Softplus → predicted variance (>0)

#### 3.3.4 MuLGITPerturbEnsemble

K=3-5 identical models trained with different seeds:
- Epistemic uncertainty = variance of Δ across ensemble
- Aleatoric uncertainty = mean σ² from each model
- Total uncertainty = epistemic + aleatoric

---

## 4. Data Strategy

### 4.1 Primary: Tahoe-100M

**Status:** ✅ Ready on HF Hub  
**Path:** `tahoebio/Tahoe-100M`

| Property | Value |
|----------|-------|
| Rows | 95.6M drug-gene perturbation observations |
| Drugs | ~1,100 (with canonical SMILES, PubChem CID, MOA) |
| Cell lines | 50 cancer cell lines |
| Genes | 45,134 (full transcriptome) |
| Matched pre/post | Yes (vehicle controls per cell line) |

### 4.2 Supplementary Datasets

| Dataset | Perturbation Type | Samples | HF Hub | Use |
|---------|-------------------|---------|--------|-----|
| GDSC | Drug (251) | 251K IC50 | ❌ | IC50 validation |
| DepMap | Genetic (CRISPR KO) | 1,000+ cell lines | ❌ | Genetic pert. validation |
| LINCS L1000 | Drug + genetic | 1.3M profiles | ❌ | Cross-dataset benchmark |
| Norman19 | Genetic combos | 91K cells | Partial | Combinatorial benchmark |

### 4.3 Data Loading for Training

```python
from datasets import load_dataset

# Drug metadata
drug_meta = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")

# Expression data (streaming for 95.6M rows)
expr_ds = load_dataset("tahoebio/Tahoe-100M", "expression_data", 
                        split="train", streaming=True)

# Each row: genes (int64[]), expressions (float32[]), drug, cell_line_id, 
#           canonical_smiles, pubchem_cid, moa-fine, sample
```

---

## 5. Training Objective

### 5.1 Primary Loss: Negative Log-Likelihood with Uncertainty

```
L_NLL = 0.5 · Σ_g [ log(σ²_g) + (Δ_true,g − Δ_pred,g)² / σ²_g ]
```
Heteroscedastic regression loss (Kendall & Gal, NeurIPS 2017). When σ²_g is large, the model is uncertain, and the MSE term is down-weighted.

### 5.2 Auxiliary Losses

1. **Pathway consistency:** `L_path = MSE(GSEA(Δ_pred), GSEA(Δ_true))` — pathways should shift coherently
2. **Sparsity:** `L_sparse = ||Δ_pred||₁ / (B·G)` — only genes that actually respond should shift
3. **Risk shift (optional):** `L_risk = MSE(risk_post − risk_baseline, risk_shift_target)` — ties back to survival

### 5.3 Hyperparameters

| Parameter | Value | Source |
|-----------|-------|--------|
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) | Lingshu-Cell |
| LR schedule | 2e-4 → 2e-5, cosine | Lingshu-Cell |
| Batch size | 256 | Lingshu-Cell |
| Mixed precision | bf16 | Lingshu-Cell |
| MuLGIT backbone | Frozen | Transfer from TCGA |
| Ensemble size | K=3 models | Lakshminarayanan 2017 |
| Epochs | 50-100 (ES on val NLL) | Standard |

---

## 6. Benchmark Tasks & Leakage Controls

### 6.1 Task Suite

| Task | Data | Split | Key Metric |
|------|------|-------|------------|
| Drug response prediction | Tahoe-100M | Leave-drug-out (20%) | DES@50, Pearson-Δ |
| Genetic perturbation | DepMap/Norman19 | Leave-gene-out (20%) | DES@50, Direction-match |
| Combinatorial pert. | Norman19 | Leave-combo-out (70%) | PDS, compositionality |
| Drug sensitivity (IC50) | GDSC | Leave-cell-line-out | Pearson r, Spearman ρ |
| Pathway rescue | Tahoe-100M (aging) | Leave-pathway-out | AUC of rescue |

### 6.2 Metrics

| Metric | What It Measures | Range | Target |
|--------|-----------------|-------|--------|
| **DES@K** | Fraction of true DEGs in top-K predicted | [0,1] | Higher |
| **Pearson-Δ** | Correlation of predicted vs true Δ | [-1,1] | Higher |
| **Direction-match** | Fraction of genes with correct sign | [0,1] | Higher |
| **PDS** | Discrimination between perturbations | [0,1] | Higher |
| **MMD** | Distributional fidelity in PCA space | [0,∞) | Lower |
| **RMSE** | Raw expression reconstruction error | [0,∞) | Lower |

---

## 7. Comparison to Existing Baselines

| Method | Gene Coverage | Drug Input | UQ | Attribution | HF-Ready |
|--------|--------------|------------|-----|-------------|----------|
| CPA | HVG (1-4K) | One-hot ID | Latent sampling | ❌ | ❌ |
| GEARS | All genes | One-hot ID | ❌ | Gene graph edges | ❌ |
| Lingshu-Cell | All 18K | One-hot ID | ❌ | ❌ | ❌ |
| AdaPert | HVG only | Semantic embed | ❌ | ✅ KG subgraph | ❌ |
| **MuLGIT-Perturb** | **All 45K+** | **SMILES→MolFormer** | **✅ Ensemble** | **✅ IG + KG** | **✅** |

---

## 8. Implementation Plan

### 8.1 New Files

```
mulgit/
├── perturb/
│   ├── __init__.py
│   ├── encoder.py          # PerturbationEncoder (drug + gene)
│   ├── conditioning.py     # FiLMConditioning
│   ├── decoder.py          # DeltaDecoderWithUncertainty
│   ├── model.py            # MuLGITPerturb
│   ├── ensemble.py         # MuLGITPerturbEnsemble
│   ├── attribution.py      # Integrated Gradients + KG subgraph
│   ├── losses.py           # NLL + pathway consistency + sparsity
│   ├── data.py             # Tahoe-100M perturbation data loader
│   ├── trainer.py          # Training loop with Trackio
│   ├── evaluate.py         # DES@K, PDS, Pearson-Δ, etc.
│   └── config.py           # MuLGITPerturbConfig
```

### 8.2 Five-Phase Execution

| Phase | Duration | Deliverable |
|-------|----------|-------------|
| 1. Tahoe-100M drug response | 2 weeks | Trained model + all benchmarks on leave-drug-out |
| 2. Cross-dataset generalization | 1 week | Zero-shot evaluation on LINCS, fine-tune if needed |
| 3. Genetic perturbations | 1 week | DepMap KO prediction, compare to Geneformer |
| 4. Combinatorial perturbation | 1 week | Norman19 combos, PDS evaluation |
| 5. Longevity-specific | 1 week | Aging pathway rescue, drug ranking vs static screener |

**Total: 6 weeks from start to full validation.**

---

## 9. Example Output

```json
{
  "perturbation": "Rapamycin",
  "cell_line": "A549",
  "baseline_risk": 0.342,
  "predicted_post_risk": -0.187,
  "risk_shift": -0.529,
  
  "top_degs": [
    {"gene": "MTOR", "logFC": -0.23, "ci_95": [-0.45, -0.01]},
    {"gene": "FOXO3", "logFC": 0.87, "ci_95": [0.52, 1.22]},
    {"gene": "ATG5", "logFC": 1.34, "ci_95": [0.98, 1.70]}
  ],
  
  "enriched_pathways": [
    {"pathway": "Autophagy", "NES": 2.1, "pval": 0.0003},
    {"pathway": "mTOR signaling", "NES": -1.8, "pval": 0.001}
  ],
  
  "aging_signature_reversal": 0.72,
  "epistemic_uncertainty": 0.08,
  "aleatoric_uncertainty": 0.15
}
```

---

## 10. Risk Assessment

| Risk | Mitigation |
|------|------------|
| Tahoe-100M streaming slow (95.6M rows) | Local cache after first pass; pre-filter to 10M most informative rows |
| MolFormer fails for some SMILES | Fallback to Morgan fingerprints (RDKit already in deps) |
| MuLGIT backbone overfits to TCGA | Include tissue type as conditioning; evaluate zero-shot on LINCS |
| GPU memory for 45K-gene output | Start with L1000 landmarks (978); scale after validation |
| Combinatorial too hard | Phase as stretch goal; single-perturbation alone is impactful |

---

## References

1. Lotfollahi et al., "CPA," arxiv:2011.03086 (2021)
2. Roohani et al., "GEARS," Nature Biotech 2023. arxiv:2205.13986
3. Chen et al., "Lingshu-Cell," arxiv:2603.25240 (2025)
4. Li et al., "AdaPert," arxiv:2602.18885 (2025)
5. He et al., "PerturBench," arxiv:2408.10609 (2024)
6. El Nahhas et al., "SeNMo," arxiv:2405.08226 (2024)
7. Ross et al., "MolFormer," Nature Machine Intelligence 2022
8. Theodoris et al., "Geneformer," Nature 2023
9. Perez et al., "FiLM," AAAI 2018
10. Kendall & Gal, "Bayesian UQ," NeurIPS 2017
11. Lakshminarayanan et al., "Deep Ensembles," NeurIPS 2017
12. Mendez-Lucio et al., "Tahoe-100M," bioRxiv 2024