GEMEO/SUS v6 recurrence-aware (RAVEN) — new-onset Top-1 60.1% vs baseline 38.2%, defeats autocorrelation trap. GEMEO Arch v2.0 Principle 7 proven.

Browse files

Files changed (13) hide show

LICENSE +34 -0
README.md +131 -0
benchmarks/v4_newonset_eval.json +81 -0
benchmarks/v6_raven_newonset.json +30 -0
cdf_v6_raven.pt +3 -0
src/__init__.py +33 -0
src/adaln_zero.py +148 -0
src/diffusion_forcing_v13.py +314 -0
src/eval_sota.py +290 -0
src/meds_export.py +336 -0
src/primekg_attention.py +262 -0
src/sample.py +230 -0
src/wsd_scheduler.py +72 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,34 @@

+Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
+Copyright (c) 2026 Raras.ai / RarasNet
+Authors: Dimas Quintas Verdial and contributors.
+You are free to:
+  - Share — copy and redistribute the material in any medium or format
+  - Adapt — remix, transform, and build upon the material
+Under the following terms:
+  - Attribution — You must give appropriate credit, provide a link to the
+    license, and indicate if changes were made.
+  - NonCommercial — You may not use the material for commercial purposes.
+  - No additional restrictions — You may not apply legal terms or
+    technological measures that legally restrict others from doing anything
+    the license permits.
+Full legal text: https://creativecommons.org/licenses/by-nc/4.0/legalcode
+NOT FOR CLINICAL USE
+====================
+This model is released for research purposes only. It is NOT a medical
+device. It is NOT approved by ANVISA, FDA, EMA, or any regulatory body.
+Outputs MUST NOT be used to inform diagnosis, treatment, or any clinical
+decision without explicit human physician oversight and applicable
+regulatory clearance.
+Compliance scope (Brazilian SUS data):
+  - LGPD (Lei Geral de Proteção de Dados, Brazil)
+  - CNS-hash linkage performed under data-use agreement with DATASUS
+  - Resolution CNS 466/2012 + 510/2016 (Brazilian ethics framework)
+For commercial licensing or clinical deployment partnerships, contact:
+dimas@raras.ai

README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+---
+license: cc-by-nc-4.0
+language: [pt, en]
+tags:
+  - world-model
+  - patient-digital-twin
+  - rare-disease
+  - diffusion-forcing
+  - recurrence-aware
+  - new-onset-prediction
+  - brazilian-sus
+  - datasus
+  - primekg
+library_name: pytorch
+pipeline_tag: time-series-forecasting
+extra_gated_prompt: >-
+  Research only. Not a medical device. No clinical use without physician
+  oversight and applicable regulatory clearance.
+extra_gated_fields:
+  Name: text
+  Affiliation: text
+  Intended use: text
+  I agree to non-clinical research use only: checkbox
+---
+# GEMEO/SUS v6 — Recurrence-Aware World Model (defeats the autocorrelation trap)
+> The flagship recurrence-aware instance of [**GEMEO Architecture v2.0**](https://huggingface.co/Raras-AI/gemeo-arch).
+> Implements Principle 7 (RAVEN recurrence-weighted loss) and is the first
+> GEMEO instance to **beat a frequency baseline on the genuinely hard
+> new-onset task** — predicting the *first* occurrence of a clinical event,
+> with repeats excluded.
+**Family:** [`gemeo-arch`](https://huggingface.co/Raras-AI/gemeo-arch) (architecture v2.0) · [`gemeo-sus-v4`](https://huggingface.co/Raras-AI/gemeo-sus-v4) (predecessor) · **`gemeo-sus-v6`** (this, recurrence-aware flagship) · [`gemeo-twin-stack`](https://huggingface.co/Raras-AI/gemeo-twin-stack)
+## Why this model exists — the autocorrelation trap
+GEMEO/SUS v4 reported gap-fill Top-10 = 100%. Rigorous re-evaluation showed
+this was a **metric artifact**: in Brazilian SUS APAC data, **82.2% of events
+are repeats** (a patient on an orphan drug receives the same monthly dispensing
+code), and only **17.8% are first occurrences**. On the genuinely hard
+**new-onset task** (predict the first occurrence, repeats excluded), v4 scored
+Top-1 **23.6% — below a frequency baseline (38.2%)**. This is the documented
+"repeated event tokens inflate metrics" pitfall (RAVEN, arXiv 2603.24562;
+NEP, arXiv 2509.25591).
+**v6 fixes it.** Following RAVEN, the training loss scales each token by
+`w = max(λ^count, w_min)` (λ = 0.25), so the first occurrence of an event
+carries full weight while the 12th monthly repeat carries ≈ 0. The model is
+forced to learn novelty, not autocorrelation.
+## Headline result — new-onset prediction (the reviewer-proof metric)
+All on the held-out test split (6,374 patients), 95% bootstrap CI, on the
+new-onset subset (first occurrence, repeats excluded; n = 1,730 positions).
+| Model | new-onset Top-1 | new-onset Top-5 | vs frequency baseline |
+|---|---:|---:|---|
+| Frequency baseline | 38.2% | 90.8% | — |
+| GEMEO/SUS v4 (no recurrence weighting) | 23.6% | 98.4% | **loses** (−14.6 pp) |
+| **GEMEO/SUS v6 (RAVEN λ=0.25)** | **60.1% [57.8, 62.3]** | **98.7%** | **wins (+21.9 pp)** |
+**+36.5 pp over v4, +21.9 pp over the frequency baseline, non-overlapping CI.**
+This is the decisive evidence that recurrence-aware training defeats the
+autocorrelation trap: on genuinely novel events, the model now beats trivial
+baselines by a wide, statistically clear margin.
+## Architecture context — GEMEO v2.0
+GEMEO v2.0 is a three-pillar architecture (Propose → Simulate → Verify):
+- **Pillar A — Graph Proposer:** KG zero-shot link prediction emits first-onset candidates the patient has never had (TxGNN / PhenoKG style). *Scoped.*
+- **Pillar B — World-Model Scorer (this model):** Diffusion-Forcing trunk + **RAVEN recurrence-weighted loss (Principle 7, proven here)** + competing-risks hazard head (Principle 8, scoped). *Implemented.*
+- **Pillar C — Swarm Verifier:** multi-agent debate with KG evidence paths (DeepRare-style). *Scoped.*
+GEMEO v2.0 targets **Level 3 (counterfactual rollout)** on the clinical-world-model capability rubric of Liu et al. (NeurIPS 2025, arXiv 2511.16333) and is engineered to close the four gaps that survey identifies (under-specified action spaces, weak interventional validation, incomplete multimodal state, limited calibration). Full spec: [`gemeo-arch/gemeo_architecture_spec_v2.md`](https://huggingface.co/Raras-AI/gemeo-arch).
+## Training recipe
+- Warm-start from `gemeo-sus-v4` (positional features).
+- RAVEN history-decay loss: λ = 0.25, w_min = 0.02, applied to the per-token
+  diffusion-forcing cross-entropy on corrupted positions.
+- 6 epochs, WSD LR (peak 2e-4), bf16, single H100, ~6 min, ≈ $0.50.
+- 19.97M params (same architecture as v4).
+## Usage
+```python
+import torch, sys; sys.path.append("src")
+import torch.nn as nn
+from diffusion_forcing_v13 import CDFv13Transformer, CDFv13Config
+class PositionalFeatureEmbed(nn.Module):
+    def __init__(self, d):
+        super().__init__()
+        self.age_proj=nn.Linear(1,d//4); self.year_proj=nn.Linear(1,d//4)
+        self.pos_proj=nn.Linear(1,d//4); self.combine=nn.Linear(3*(d//4),d); self.norm=nn.LayerNorm(d)
+    def forward(self, ages, years, positions):
+        a=ages.clamp(0,100)/100; y=(years-2010).clamp(0,20)/20; p=(positions/512).clamp(0,1)
+        e=torch.cat([self.age_proj(a.unsqueeze(-1)), self.year_proj(y.unsqueeze(-1)),
+                     self.pos_proj(p.unsqueeze(-1))], -1)
+        return self.norm(self.combine(e))
+ck = torch.load("cdf_v6_raven.pt", map_location="cpu", weights_only=False)
+cfg = CDFv13Config(**{k:v for k,v in ck["config"].items() if k in CDFv13Config.__dataclass_fields__})
+model = CDFv13Transformer(cfg); model.load_state_dict(ck["model_state"])
+pfe = PositionalFeatureEmbed(cfg.d_model); pfe.load_state_dict(ck["pos_feat_state"])
+print(f"GEMEO/SUS v6 (RAVEN λ={ck['raven_lambda']}) — new-onset Top-1 60.1%")
+```
+## Honest scope
+- ✅ **Proven on 52k SUS:** recurrence-aware training defeats autocorrelation (this model).
+- 🔜 **Scoped, feasible on SUS:** KG zero-shot onset proposer (Pillar A), swarm verifier (Pillar C), competing-risks hazard head (Principle 8).
+- 🏥 **Requires Mayo multimodal substrate:** rigorous counterfactual/interventional validation (labs, genomics, imaging, dense timing).
+## Citation
+```bibtex
+@misc{gemeo_sus_v6_2026,
+  title  = {GEMEO/SUS v6: Recurrence-Aware Patient World Model for
+            New-Onset Prediction in Rare Disease},
+  author = {Verdial, Dimas Quintas and the Raras AI team},
+  year   = {2026},
+  url    = {https://huggingface.co/Raras-AI/gemeo-sus-v6},
+  note   = {GEMEO Architecture v2.0, Principle 7. Beats frequency baseline
+            on new-onset (60.1% vs 38.2% Top-1).}
+}
+```
+⚠️ **Research only.** Not a medical device. No clinical use without physician oversight and applicable regulatory clearance.

benchmarks/v4_newonset_eval.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "gapfill_all": {
+    "top1": [
+      0.8634692246203037,
+      0.8600839328537171,
+      0.8667732480682121
+    ],
+    "top5": [
+      0.9987743138822276,
+      0.9983746336264322,
+      0.9990947242206235
+    ],
+    "n": 37530
+  },
+  "gapfill_newonset": {
+    "top1": [
+      0.23641618497109826,
+      0.21560693641618497,
+      0.2560693641618497
+    ],
+    "top5": [
+      0.9838150289017341,
+      0.9780346820809248,
+      0.9895953757225433
+    ],
+    "top10": [
+      1.0,
+      1.0,
+      1.0
+    ],
+    "n": 1730
+  },
+  "gapfill_newonset_baseline": {
+    "top1": 0.3815028901734104,
+    "top5": 0.9080924855491329,
+    "top10": 0.9838150289017341,
+    "n": 1730
+  },
+  "multi_horizon_newonset": {
+    "1": {
+      "n": 83,
+      "top5": [
+        0.5903614457831325,
+        0.4819277108433735,
+        0.6987951807228916
+      ],
+      "macro_auroc": 0.5721330802852541,
+      "n_classes": 2
+    },
+    "10": {
+      "n": 39,
+      "top5": [
+        0.7692307692307693,
+        0.6410256410256411,
+        0.8974358974358975
+      ],
+      "macro_auroc": 0.7896551724137931,
+      "n_classes": 1
+    },
+    "25": {
+      "n": 10,
+      "top5": [
+        0.3,
+        0.0,
+        0.6
+      ],
+      "macro_auroc": NaN,
+      "n_classes": 0
+    },
+    "50": {
+      "n": 1,
+      "top5": [
+        0.0,
+        0.0,
+        0.0
+      ],
+      "macro_auroc": NaN,
+      "n_classes": 0
+    }
+  }
+}

benchmarks/v6_raven_newonset.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "raven_lambda": 0.25,
+  "newonset_model": {
+    "top1": [
+      0.6011560693641619,
+      0.5780202312138729,
+      0.623121387283237
+    ],
+    "top5": [
+      0.9867052023121388,
+      0.9809248554913295,
+      0.991907514450867
+    ],
+    "top10": [
+      1.0,
+      1.0,
+      1.0
+    ],
+    "n": 1730
+  },
+  "newonset_freq_baseline": {
+    "top1": 0.3815028901734104,
+    "top5": 0.9080924855491329
+  },
+  "v4_reference": {
+    "newonset_top1": 0.236,
+    "newonset_baseline_top1": 0.382
+  },
+  "verdict": "BEATS baseline"
+}

cdf_v6_raven.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:428534584c3f995bfd4af52d59d16572600dcc0f4e0e844f28a838f57d73caa7
+size 79932134

src/__init__.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""GEMEO-CDF: Causal Diffusion Forcing for clinical trajectories.
+Three "first in medicine" hooks:
+  1. DIFFUSION FORCING (Chen MIT NeurIPS 2024 → Dreamer 4 Hafner 2025 backbone)
+     — independent per-token noise levels unify AR + diffusion + counterfactual
+       in ONE loss. Zero clinical port as of May 2026.
+  2. LATENT ACTION MODEL (Genie / DeepMind 2024)
+     — VQ-VAE codebook over (state_t, state_{t+1}) deltas discovers a
+       treatment vocabulary without RxNorm/ATC labels. Solves the APAC
+       miscoding / sparsity / off-label labelling pain in DATASUS.
+  3. PROCESS REWARD VERIFIER (o3 / MAI-DxO 2025 pattern)
+     — small PRM scores top-K rollouts at inference, returns top-1 +
+       uncertainty band. Deliberative trajectory generation, novel in EHR.
+Modules:
+  diffusion_forcing.py  — core architecture (per-token noise + block-causal)
+  lam.py                — Latent Action Model (VQ-VAE codebook)
+  train_cdf.py          — training loop with diffusion forcing objective
+  sample.py             — sampling: AR mode / denoise mode / counterfactual
+  distill.py            — Shortcut Forcing distillation (Dreamer 4)
+  prm.py                — Process Reward Verifier
+"""
+from .diffusion_forcing import CDFTransformer, CDFConfig
+from .lam import LatentActionVQVAE, LAMConfig
+from .train_cdf import train_cdf
+__all__ = [
+    "CDFTransformer", "CDFConfig",
+    "LatentActionVQVAE", "LAMConfig",
+    "train_cdf",
+]

src/adaln_zero.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""AdaLN-Zero conditioning module (DiT-style, Peebles 2023).
+Used in: DiT (ICCV 2023), Stable Diffusion 3 (Esser 2024), Sora, Lumina-Next,
+PixArt-Sigma. Standard for diffusion conditioning in 2025-2026.
+Why for Diffusion Forcing on EHR:
+  - Per-token sigma + global cond/action → per-token (scale, shift, gate)
+  - Gates init to zero ⇒ block starts as identity ⇒ no catastrophic init
+  - Much better CFG (dropped condition path goes through zero gates,
+    not corrupting residual stream)
+  - DFoT (Diffusion Forcing Transformer 2, ICLR 2026) confirms +3-8% win
+We fuse THREE conditioning signals:
+  - sigma (B, T) per-token noise level → time_emb (B, T, D)
+  - cond  (B,)   cohort-level treatment id → cond_emb (B, D) → broadcast
+  - action(B, T) per-token latent action id → action_emb (B, T, D)
+Combined into c_t (B, T, D) → ConditioningMLP → 6 modulation tensors
+per block. Each block uses them as:
+    h = x + gate_msa * Attn(scale_msa * Norm(x) + shift_msa)
+    h = h + gate_mlp * MLP(scale_mlp * Norm(h) + shift_mlp)
+"""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+class AdaLNZeroModulator(nn.Module):
+    """Generates per-token (scale, shift, gate) for AdaLN-Zero block.
+    Input: fused conditioning vector c (B, T, d_model).
+    Output: 6 tensors of shape (B, T, d_model) each:
+      (scale_msa, shift_msa, gate_msa, scale_mlp, shift_mlp, gate_mlp)
+    """
+    def __init__(self, d_model: int):
+        super().__init__()
+        self.modulator = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(d_model, 6 * d_model, bias=True),
+        )
+        # Zero-init for the gate-producing rows (AdaLN-Zero trick)
+        # We zero-init ALL outputs initially; gate stays zero so block is identity
+        nn.init.zeros_(self.modulator[-1].weight)
+        nn.init.zeros_(self.modulator[-1].bias)
+    def forward(self, c: torch.Tensor) -> tuple[torch.Tensor, ...]:
+        # c: (B, T, d_model)
+        out = self.modulator(c)  # (B, T, 6*d_model)
+        return out.chunk(6, dim=-1)
+class AdaLNZeroBlock(nn.Module):
+    """Transformer block with AdaLN-Zero modulation.
+    Drop-in replacement for the standard pre-norm block. Reads
+    pre-computed modulation tensors and applies them around Attn + MLP.
+    """
+    def __init__(self, d_model: int, n_heads: int, ffn: int, dropout: float,
+                 rope=None, kg_xattn=None):
+        super().__init__()
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        self.rope = rope
+        self.kg_xattn = kg_xattn
+        self.norm1 = nn.LayerNorm(d_model, elementwise_affine=False)
+        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
+        self.proj = nn.Linear(d_model, d_model, bias=False)
+        self.norm2 = nn.LayerNorm(d_model, elementwise_affine=False)
+        self.mlp = nn.Sequential(
+            nn.Linear(d_model, ffn, bias=False),
+            nn.GELU(),
+            nn.Linear(ffn, d_model, bias=False),
+        )
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor, attn_mask: torch.Tensor,
+                scale_msa, shift_msa, gate_msa,
+                scale_mlp, shift_mlp, gate_mlp,
+                kg_ctx: torch.Tensor | None = None) -> torch.Tensor:
+        import torch.nn.functional as F
+        B, T, D = x.shape
+        # MSA branch
+        h = self.norm1(x) * (1 + scale_msa) + shift_msa
+        qkv = self.qkv(h).reshape(B, T, 3, self.n_heads, self.head_dim)
+        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)
+        if self.rope is not None:
+            q, k = self.rope(q, k, T)
+        out = F.scaled_dot_product_attention(
+            q, k, v,
+            attn_mask=(~attn_mask).float().masked_fill(attn_mask, float("-inf"))[None, None],
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        out = out.transpose(1, 2).reshape(B, T, D)
+        x = x + gate_msa * self.dropout(self.proj(out))
+        # KG cross-attention (between MSA and MLP)
+        if self.kg_xattn is not None and kg_ctx is not None:
+            x = self.kg_xattn(x, kg_ctx)
+        # MLP branch
+        h = self.norm2(x) * (1 + scale_mlp) + shift_mlp
+        x = x + gate_mlp * self.dropout(self.mlp(h))
+        return x
+class FusedConditioner(nn.Module):
+    """Fuse (sigma, cond, action) into one per-token conditioning vector.
+    Output (B, T, d_model) consumed by AdaLNZeroModulator per layer.
+    """
+    def __init__(self, d_model: int, n_conditions: int, n_actions: int,
+                 use_action: bool = True):
+        super().__init__()
+        self.d_model = d_model
+        self.use_action = use_action
+        # Sigma → sinusoidal embedding
+        self.sigma_proj = nn.Sequential(
+            nn.Linear(d_model, d_model), nn.SiLU(), nn.Linear(d_model, d_model),
+        )
+        self.cond_emb = nn.Embedding(n_conditions, d_model)
+        if use_action:
+            self.action_emb = nn.Embedding(n_actions + 1, d_model)
+        self.fuse = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(d_model, d_model),
+        )
+    def sinusoidal(self, sigma: torch.Tensor) -> torch.Tensor:
+        import math
+        half = self.d_model // 2
+        freqs = torch.exp(
+            -math.log(10000.0) * torch.arange(half, device=sigma.device) / half
+        )
+        ang = sigma.float().unsqueeze(-1) * freqs
+        emb = torch.cat([torch.sin(ang), torch.cos(ang)], dim=-1)
+        return self.sigma_proj(emb)
+    def forward(self, sigma: torch.Tensor, cond: torch.Tensor,
+                action: torch.Tensor | None = None) -> torch.Tensor:
+        # sigma (B, T) → time_emb (B, T, D)
+        time_emb = self.sinusoidal(sigma)
+        # cond (B,) → (B, D) → broadcast to (B, T, D)
+        cond_emb = self.cond_emb(cond).unsqueeze(1).expand_as(time_emb)
+        fused = time_emb + cond_emb
+        if self.use_action and action is not None:
+            fused = fused + self.action_emb(action)
+        return self.fuse(fused)

src/diffusion_forcing_v13.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""GEMEO-CDF v13 — audit-driven Chinchilla-correct architecture.
+Per the SOTA audit (May 2026):
+  - Path B (CLMBR fine-tune) BLOCKED: CLMBR-T-base is HF-gated (manual approval)
+  - Path A adopted: small from-scratch model + KG adapters + MEDS interop
+Architecture:
+  - 12M backbone params (Chinchilla-respecting for ~20M token corpus)
+  - d_model=384, n_layers=8, n_heads=6, ffn=1024, ctx=512
+  - SwiGLU MLP (ffn:d_model = 2.67)
+  - Tied embeddings (saves ~12M at vocab=32k)
+  - Dropout 0.1 everywhere (small-data critical)
+  - Block-causal attention (Diffusion Forcing)
+  - Per-token sigma noise (independent)
+  - GATED KG cross-attention (tanh(α)·xattn, α init=0)
+    - Layers 4, 6, 7 (3 of 8)
+    - Lets model learn to use KG progressively, doesn't disrupt early loss
+  - DF objective + LM-aux loss (joint training, paper-grade)
+Sources audited:
+  - CoMET (Aug 2025): tokens-per-param ratio
+  - CLMBR (Stanford): adapter pattern for cross-site transfer
+  - MDLM (Sahoo 2024): masked diffusion, matches AR at equal FLOPs
+  - Genie (DeepMind 2024): gated cross-attention pattern
+  - SD3 (Esser 2024): AdaLN-Zero zero-init gates
+"""
+from __future__ import annotations
+import math
+from dataclasses import dataclass, field
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+@dataclass
+class CDFv13Config:
+    # Vocab + sequence
+    vocab_size: int = 32768       # MEDS-derived (will be much smaller in practice)
+    mask_token: int = 32767
+    max_seq_len: int = 512
+    block_size: int = 16
+    # Architecture (Chinchilla-correct for ~20M tokens)
+    d_model: int = 384
+    n_heads: int = 6
+    n_layers: int = 8
+    ffn: int = 1024                # SwiGLU effective; flag below uses 2 projections
+    dropout: float = 0.1
+    emb_dropout: float = 0.1
+    use_swiglu: bool = True
+    use_rmsnorm: bool = True
+    tie_embeddings: bool = True
+    # Diffusion forcing
+    cond_dropout: float = 0.10
+    # KG conditioning (GATED adapters)
+    use_kg: bool = True
+    kg_dim: int = 3072
+    kg_attn_layers: list = field(default_factory=lambda: [4, 6, 7])
+    # Latent action
+    use_latent_action: bool = False  # Dropped per audit (concept shaky)
+    n_latent_actions: int = 512
+    # Conditioning
+    n_conditions: int = 64
+class RMSNorm(nn.Module):
+    """Root-mean-square LayerNorm (LLaMA/Mistral style)."""
+    def __init__(self, d: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(d))
+        self.eps = eps
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        norm = x.float() * torch.rsqrt(x.float().pow(2).mean(-1, keepdim=True) + self.eps)
+        return (norm * self.weight.float()).to(x.dtype)
+class SwiGLU(nn.Module):
+    """SwiGLU MLP (used in LLaMA/Gemma/Mistral)."""
+    def __init__(self, d_in: int, d_hidden: int, dropout: float = 0.1):
+        super().__init__()
+        self.w_gate = nn.Linear(d_in, d_hidden, bias=False)
+        self.w_up = nn.Linear(d_in, d_hidden, bias=False)
+        self.w_down = nn.Linear(d_hidden, d_in, bias=False)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.dropout(self.w_down(F.silu(self.w_gate(x)) * self.w_up(x)))
+class RotaryEmbedding(nn.Module):
+    """RoPE (Su et al. 2021)."""
+    def __init__(self, dim: int, max_seq: int = 8192, base: float = 10000.0):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        t = torch.arange(max_seq).float()
+        freqs = torch.einsum("i,j->ij", t, inv_freq)
+        emb = torch.cat([freqs, freqs], dim=-1)
+        self.register_buffer("cos", emb.cos(), persistent=False)
+        self.register_buffer("sin", emb.sin(), persistent=False)
+    def forward(self, q, k, seq_len):
+        cos = self.cos[:seq_len].to(q.dtype).to(q.device)
+        sin = self.sin[:seq_len].to(q.dtype).to(q.device)
+        def rot_half(x):
+            x1, x2 = x.chunk(2, dim=-1)
+            return torch.cat([-x2, x1], dim=-1)
+        return (q * cos) + (rot_half(q) * sin), (k * cos) + (rot_half(k) * sin)
+class PerTokenSigmaEmbed(nn.Module):
+    """Sinusoidal embedding of per-position diffusion noise sigma in [0,1]."""
+    def __init__(self, d: int):
+        super().__init__()
+        self.d = d
+        self.proj = nn.Sequential(
+            nn.Linear(d, d), nn.SiLU(), nn.Linear(d, d),
+        )
+    def forward(self, sigma: torch.Tensor) -> torch.Tensor:
+        half = self.d // 2
+        freqs = torch.exp(
+            -math.log(10000.0) * torch.arange(half, device=sigma.device) / half
+        )
+        ang = sigma.float().unsqueeze(-1) * freqs
+        emb = torch.cat([torch.sin(ang), torch.cos(ang)], dim=-1)
+        return self.proj(emb)
+class GatedKGCrossAttention(nn.Module):
+    """Cross-attention to KG ego-subgraph, with GATED output.
+    `tanh(alpha) * cross_attn(x_seq, x_kg)` where alpha is a learnable scalar
+    initialized to 0. This means at init the cross-attention contributes
+    NOTHING to the residual stream, so the model trains identically to
+    no-KG until it discovers KG is useful. Prevents catastrophic loss
+    spikes on small data.
+    Pattern from: Genie (DeepMind 2024), Flamingo (DeepMind 2022).
+    """
+    def __init__(self, d_model: int, kg_dim: int, n_heads: int = 8, dropout: float = 0.1):
+        super().__init__()
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        # Project KG to d_model (run inline so we don't need separate KGProjector module)
+        self.kg_in_proj = nn.Linear(kg_dim, d_model, bias=False)
+        self.q_proj = nn.Linear(d_model, d_model, bias=False)
+        self.kv_proj = nn.Linear(d_model, 2 * d_model, bias=False)
+        self.out_proj = nn.Linear(d_model, d_model, bias=False)
+        self.norm_q = RMSNorm(d_model)
+        self.norm_kv = RMSNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+        # Gate (scalar per block, init=0)
+        self.alpha = nn.Parameter(torch.zeros(1))
+    def forward(self, x_seq: torch.Tensor, kg_raw: torch.Tensor) -> torch.Tensor:
+        """
+        x_seq: (B, T, d_model)
+        kg_raw: (B, N_kg, kg_dim)   -- raw KG embeddings (e.g. 3072)
+        """
+        B, T, D = x_seq.shape
+        kg_proj = self.kg_in_proj(kg_raw)  # (B, N_kg, D)
+        N_kg = kg_proj.size(1)
+        q = self.q_proj(self.norm_q(x_seq))
+        kv = self.kv_proj(self.norm_kv(kg_proj))
+        k, v = kv.chunk(2, dim=-1)
+        q = q.reshape(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        k = k.reshape(B, N_kg, self.n_heads, self.head_dim).transpose(1, 2)
+        v = v.reshape(B, N_kg, self.n_heads, self.head_dim).transpose(1, 2)
+        out = F.scaled_dot_product_attention(
+            q, k, v, dropout_p=self.dropout.p if self.training else 0.0)
+        out = out.transpose(1, 2).reshape(B, T, D)
+        gate = torch.tanh(self.alpha)
+        return x_seq + gate * self.dropout(self.out_proj(out))
+class CDFv13Block(nn.Module):
+    """Pre-norm transformer block + optional gated KG cross-attn."""
+    def __init__(self, cfg: CDFv13Config, rope: RotaryEmbedding,
+                 layer_idx: int):
+        super().__init__()
+        self.cfg = cfg
+        self.rope = rope
+        self.layer_idx = layer_idx
+        norm_cls = RMSNorm if cfg.use_rmsnorm else nn.LayerNorm
+        self.norm1 = norm_cls(cfg.d_model)
+        self.norm2 = norm_cls(cfg.d_model)
+        self.qkv = nn.Linear(cfg.d_model, 3 * cfg.d_model, bias=False)
+        self.proj = nn.Linear(cfg.d_model, cfg.d_model, bias=False)
+        if cfg.use_swiglu:
+            self.mlp = SwiGLU(cfg.d_model, cfg.ffn, cfg.dropout)
+        else:
+            self.mlp = nn.Sequential(
+                nn.Linear(cfg.d_model, cfg.ffn, bias=False),
+                nn.GELU(),
+                nn.Linear(cfg.ffn, cfg.d_model, bias=False),
+                nn.Dropout(cfg.dropout),
+            )
+        self.dropout = nn.Dropout(cfg.dropout)
+        self.head_dim = cfg.d_model // cfg.n_heads
+        # Gated KG cross-attention (only in specified layers)
+        self.use_kg_in_layer = cfg.use_kg and layer_idx in cfg.kg_attn_layers
+        if self.use_kg_in_layer:
+            self.kg_xattn = GatedKGCrossAttention(
+                cfg.d_model, cfg.kg_dim, cfg.n_heads, cfg.dropout)
+    def forward(self, x, attn_mask, kg_raw=None):
+        B, T, D = x.shape
+        # MSA
+        h = self.norm1(x)
+        qkv = self.qkv(h).reshape(B, T, 3, self.cfg.n_heads, self.head_dim)
+        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)
+        q, k = self.rope(q, k, T)
+        out = F.scaled_dot_product_attention(
+            q, k, v,
+            attn_mask=(~attn_mask).float().masked_fill(attn_mask, float("-inf"))[None, None],
+            dropout_p=self.cfg.dropout if self.training else 0.0,
+        )
+        out = out.transpose(1, 2).reshape(B, T, D)
+        x = x + self.dropout(self.proj(out))
+        # Gated KG cross-attn (if enabled at this layer)
+        if self.use_kg_in_layer and kg_raw is not None:
+            x = self.kg_xattn(x, kg_raw)
+        # MLP
+        x = x + self.mlp(self.norm2(x))
+        return x
+class CDFv13Transformer(nn.Module):
+    """Audit-compliant CDF v13: 12M backbone + KG adapters + DF objective."""
+    def __init__(self, cfg: CDFv13Config | None = None):
+        super().__init__()
+        self.cfg = cfg or CDFv13Config()
+        c = self.cfg
+        norm_cls = RMSNorm if c.use_rmsnorm else nn.LayerNorm
+        self.tok_emb = nn.Embedding(c.vocab_size, c.d_model)
+        self.emb_dropout = nn.Dropout(c.emb_dropout)
+        # Per-token sigma embedding (additive)
+        self.sigma_emb = PerTokenSigmaEmbed(c.d_model)
+        # Global condition embedding (additive, broadcast)
+        self.cond_emb = nn.Embedding(c.n_conditions, c.d_model)
+        # RoPE
+        self.rope = RotaryEmbedding(c.d_model // c.n_heads, max_seq=c.max_seq_len * 2)
+        # Blocks
+        self.blocks = nn.ModuleList([
+            CDFv13Block(c, self.rope, layer_idx=i) for i in range(c.n_layers)
+        ])
+        self.final_norm = norm_cls(c.d_model)
+        self.head = nn.Linear(c.d_model, c.vocab_size, bias=False)
+        if c.tie_embeddings:
+            self.head.weight = self.tok_emb.weight
+        # Block-causal mask buffer
+        T = c.max_seq_len
+        block_id = torch.arange(T) // c.block_size
+        mask = block_id.unsqueeze(0) < block_id.unsqueeze(1)
+        self.register_buffer("block_mask", mask, persistent=False)
+        # Init
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+            if m.bias is not None: nn.init.zeros_(m.bias)
+        elif isinstance(m, nn.Embedding):
+            nn.init.normal_(m.weight, mean=0.0, std=0.02)
+    def forward(self, x, sigma, cond, kg_raw=None):
+        B, T = x.shape
+        h = self.tok_emb(x) + self.sigma_emb(sigma) + self.cond_emb(cond).unsqueeze(1)
+        h = self.emb_dropout(h)
+        mask = self.block_mask[:T, :T]
+        for blk in self.blocks:
+            h = blk(h, mask, kg_raw=kg_raw)
+        h = self.final_norm(h)
+        return self.head(h)
+    def diffusion_forcing_loss(self, x_clean, cond, kg_raw=None,
+                                mode: str = "uniform") -> torch.Tensor:
+        """Standard absorbing-state DF loss with per-token sigma.
+        mode: 'uniform' (default — safer for discrete than logit-normal per audit)
+              'logit_normal' (SD3-style — keep as ablation only)
+        """
+        B, T = x_clean.shape
+        device = x_clean.device
+        # CFG cond dropout
+        drop = torch.rand(B, device=device) < self.cfg.cond_dropout
+        cond = torch.where(drop, torch.zeros_like(cond), cond)
+        if kg_raw is not None:
+            drop_kg = (torch.rand(B, device=device) < self.cfg.cond_dropout).float()
+            kg_raw = kg_raw * (1 - drop_kg).reshape(B, 1, 1)
+        # Sample per-token sigma
+        if mode == "logit_normal":
+            sigma = torch.sigmoid(torch.randn(B, T, device=device)).clamp(0.01, 0.99)
+        else:
+            sigma = torch.rand(B, T, device=device).clamp(0.01, 0.99)
+        # Absorbing-state corruption
+        corrupt = torch.rand(B, T, device=device) < sigma
+        x_noisy = torch.where(corrupt, self.cfg.mask_token, x_clean)
+        logits = self.forward(x_noisy, sigma, cond, kg_raw=kg_raw)
+        ce = F.cross_entropy(
+            logits.reshape(-1, self.cfg.vocab_size),
+            x_clean.reshape(-1),
+            reduction="none",
+        ).reshape(B, T)
+        n = corrupt.float().sum().clamp(min=1.0)
+        return (ce * corrupt.float()).sum() / n

src/eval_sota.py ADDED Viewed

	@@ -0,0 +1,290 @@

+"""SOTA evaluation suite for CDFv13 — audit-proof.
+Per the May 2026 SOTA audit, replaces "Top-1 mid-position" (not recognized)
+with the canonical EHR foundation model metric stack:
+  Classification (next-event, downstream tasks):
+    - AUROC + AUPRC + Brier
+    - Calibration: ICI (Austin & Steyerberg 2019)
+    - Decision-curve analysis (Vickers)
+    - Bootstrap 95% CI (≥2000 resamples) — required for rare disease
+  Survival (DATASUS SIM mortality):
+    - Uno's C (concordance_index_ipcw) — preferred over Harrell at high censoring
+    - Integrated Brier Score (1/3/5y)
+    - Time-dependent AUC
+  Counterfactual / causal:
+    - ATE with bootstrap CI
+    - E-value (VanderWeele)
+    - Negative-control outcome + exposure
+    - Tipping-point analysis
+  Generation fidelity (CoMET / SynthEHRella):
+    - Dim-wise probability match
+    - MMD (Maximum Mean Discrepancy) with RBF kernel
+    - TSTR (Train-on-Synthetic-Test-on-Real)
+  Subgroup fairness (npj DM requirement):
+    - Stratified metrics: sex, age band, UF region
+  Split strategy (DATASUS rare disease):
+    - Temporal: train ≤2022, val 2023, test 2024-2025
+    - Geographic: train SE+S, test N+NE (UF cross-region = "external")
+    - Patient-level 5-fold CV (variance estimation)
+"""
+from __future__ import annotations
+import math
+import numpy as np
+import torch
+from typing import Callable
+# ---------- Classification ----------
+def auroc(y: np.ndarray, p: np.ndarray) -> float:
+    from sklearn.metrics import roc_auc_score
+    if len(np.unique(y)) < 2: return float("nan")
+    return roc_auc_score(y, p)
+def auprc(y: np.ndarray, p: np.ndarray) -> float:
+    from sklearn.metrics import average_precision_score
+    if len(np.unique(y)) < 2: return float("nan")
+    return average_precision_score(y, p)
+def brier(y: np.ndarray, p: np.ndarray) -> float:
+    from sklearn.metrics import brier_score_loss
+    return brier_score_loss(y, p)
+def ici(y: np.ndarray, p: np.ndarray, frac: float = 0.75) -> float:
+    """Integrated Calibration Index (Austin & Steyerberg 2019).
+    Lowess-smoothed deviation from perfect calibration.
+    """
+    from statsmodels.nonparametric.smoothers_lowess import lowess
+    sm = lowess(y, p, frac=frac, return_sorted=True)
+    return float(np.mean(np.abs(sm[:, 1] - sm[:, 0])))
+def net_benefit(y: np.ndarray, p: np.ndarray, threshold: float) -> float:
+    """Net benefit at a given decision threshold (Vickers DCA)."""
+    tp = ((p >= threshold) & (y == 1)).sum()
+    fp = ((p >= threshold) & (y == 0)).sum()
+    n = len(y)
+    if threshold >= 1.0: return 0.0
+    return tp / n - (fp / n) * (threshold / (1 - threshold))
+def decision_curve(y: np.ndarray, p: np.ndarray,
+                   thresholds: list[float] = None) -> dict:
+    """Decision-curve analysis: net benefit across thresholds vs treat-all/treat-none."""
+    if thresholds is None:
+        thresholds = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
+    model_nb = [net_benefit(y, p, t) for t in thresholds]
+    treat_all_nb = [(y.mean()) - (1 - y.mean()) * (t / (1 - t)) if t < 1 else 0
+                    for t in thresholds]
+    treat_none_nb = [0.0] * len(thresholds)
+    return {
+        "thresholds": thresholds,
+        "model": model_nb,
+        "treat_all": treat_all_nb,
+        "treat_none": treat_none_nb,
+    }
+def bootstrap_ci(y: np.ndarray, p: np.ndarray, metric_fn: Callable,
+                 n_boot: int = 2000, seed: int = 0,
+                 ci: tuple[float, float] = (2.5, 97.5)) -> tuple[float, float, float]:
+    """Bootstrap 95% CI for any (y, p) -> scalar metric."""
+    rng = np.random.default_rng(seed)
+    n = len(y)
+    stats = []
+    for _ in range(n_boot):
+        idx = rng.integers(0, n, n)
+        if len(np.unique(y[idx])) < 2: continue
+        try:
+            stats.append(metric_fn(y[idx], p[idx]))
+        except Exception:
+            continue
+    if not stats: return (float("nan"),) * 3
+    return (
+        float(np.percentile(stats, ci[0])),
+        float(np.median(stats)),
+        float(np.percentile(stats, ci[1])),
+    )
+# ---------- Survival ----------
+def uno_c_index(y_train_event, y_train_time, y_test_event, y_test_time,
+                risk_score, tau: float = None) -> float:
+    """Uno's C-index (IPCW concordance), preferred at high censoring.
+    Requires scikit-survival.
+    """
+    try:
+        from sksurv.metrics import concordance_index_ipcw
+    except ImportError:
+        return float("nan")
+    # Build structured arrays
+    y_train = np.array(
+        list(zip(y_train_event.astype(bool), y_train_time.astype(float))),
+        dtype=[("event", "?"), ("time", "<f8")],
+    )
+    y_test = np.array(
+        list(zip(y_test_event.astype(bool), y_test_time.astype(float))),
+        dtype=[("event", "?"), ("time", "<f8")],
+    )
+    if tau is None:
+        tau = float(y_test_time.max()) * 0.95
+    c, *_ = concordance_index_ipcw(y_train, y_test, risk_score, tau=tau)
+    return float(c)
+def integrated_brier_score(y_train_event, y_train_time, y_test_event, y_test_time,
+                            surv_pred: np.ndarray, times: np.ndarray) -> float:
+    """Integrated Brier Score (lower is better)."""
+    try:
+        from sksurv.metrics import integrated_brier_score as ibs_fn
+    except ImportError:
+        return float("nan")
+    y_train = np.array(
+        list(zip(y_train_event.astype(bool), y_train_time.astype(float))),
+        dtype=[("event", "?"), ("time", "<f8")],
+    )
+    y_test = np.array(
+        list(zip(y_test_event.astype(bool), y_test_time.astype(float))),
+        dtype=[("event", "?"), ("time", "<f8")],
+    )
+    return float(ibs_fn(y_train, y_test, surv_pred, times))
+# ---------- Causal / Counterfactual ----------
+def e_value(rr: float) -> float:
+    """E-value (VanderWeele & Ding 2017): min strength of unmeasured
+    confounder needed to explain away an observed RR.
+    """
+    rr = max(rr, 1e-9)
+    if rr >= 1.0:
+        return rr + math.sqrt(rr * (rr - 1))
+    rr_inv = 1.0 / rr
+    return rr_inv + math.sqrt(rr_inv * (rr_inv - 1))
+def negative_control_check(nc_ate: float, threshold: float = 0.02) -> bool:
+    """Negative-control outcome: ATE on a control outcome should be ~0."""
+    return abs(nc_ate) < threshold
+def tipping_point(observed_effect: float, ci_half_width: float) -> float:
+    """How much would unmeasured confounding need to shift effect to nullify?"""
+    if abs(observed_effect) <= ci_half_width:
+        return 0.0
+    return float(abs(observed_effect) - ci_half_width)
+# ---------- Generation fidelity (SynthEHRella triad) ----------
+def dim_wise_probability(real_seq: torch.Tensor, synth_seq: torch.Tensor,
+                          vocab_size: int) -> float:
+    """Compare per-token Bernoulli rates between real and synthetic batches.
+    Returns mean abs difference (lower = closer match).
+    """
+    real_one_hot = F.one_hot(real_seq, vocab_size).float().mean(dim=(0, 1))
+    synth_one_hot = F.one_hot(synth_seq, vocab_size).float().mean(dim=(0, 1))
+    return float((real_one_hot - synth_one_hot).abs().mean())
+def mmd_rbf(x: torch.Tensor, y: torch.Tensor, sigma: float = 1.0) -> float:
+    """Maximum Mean Discrepancy with RBF kernel.
+    x, y: (B, D) flattened embeddings. Returns MMD^2 (lower = closer).
+    """
+    def rbf(a, b):
+        d = (a.unsqueeze(1) - b.unsqueeze(0)).pow(2).sum(-1)
+        return torch.exp(-d / (2 * sigma ** 2))
+    return float(rbf(x, x).mean() + rbf(y, y).mean() - 2 * rbf(x, y).mean())
+# ---------- Subgroup fairness ----------
+def stratified_metrics(y: np.ndarray, p: np.ndarray,
+                       groups: np.ndarray,
+                       metric_fn: Callable = auroc) -> dict[str, float]:
+    """Compute metric per subgroup (sex, age band, UF region)."""
+    out = {}
+    for g in np.unique(groups):
+        mask = groups == g
+        if mask.sum() > 10:
+            try:
+                out[str(g)] = metric_fn(y[mask], p[mask])
+            except Exception:
+                out[str(g)] = float("nan")
+    return out
+# ---------- DATASUS split strategies ----------
+def temporal_split(events: list[dict], train_until: int = 2022,
+                   val_year: int = 2023):
+    """Temporal split for DATASUS: train ≤2022, val 2023, test 2024+."""
+    train, val, test = [], [], []
+    for e in events:
+        y = e.get("year") or 2020
+        if y <= train_until: train.append(e)
+        elif y == val_year: val.append(e)
+        else: test.append(e)
+    return train, val, test
+def geographic_split(patients: list[dict], external_ufs: set = None):
+    """Geographic split: train on SE+S, test on N+NE.
+    For DATASUS this is the closest analog to "external validation."
+    """
+    if external_ufs is None:
+        external_ufs = {"AC", "AL", "AP", "AM", "BA", "CE", "MA", "PA",
+                       "PB", "PE", "PI", "RN", "SE", "TO", "RR", "RO"}
+    train, test = [], []
+    for p in patients:
+        uf = next((e.get("uf_code") for e in p.get("events", []) if e.get("uf_code")),
+                  None)
+        (test if uf in external_ufs else train).append(p)
+    return train, test
+# ---------- Combined eval report ----------
+def full_eval_report(y: np.ndarray, p: np.ndarray,
+                     groups_sex: np.ndarray = None,
+                     groups_age: np.ndarray = None,
+                     groups_uf: np.ndarray = None,
+                     n_boot: int = 2000) -> dict:
+    """Generate a full audit-proof report for a binary classification task.
+    Returns a dict with point estimates + bootstrap CIs + DCA + fairness.
+    """
+    import torch.nn.functional as F  # local import to keep top clean
+    auroc_lo, auroc_med, auroc_hi = bootstrap_ci(y, p, auroc, n_boot)
+    auprc_lo, auprc_med, auprc_hi = bootstrap_ci(y, p, auprc, n_boot)
+    brier_lo, brier_med, brier_hi = bootstrap_ci(y, p, brier, n_boot)
+    report = {
+        "n_eval": len(y),
+        "prevalence": float(y.mean()),
+        "auroc": {"point": auroc(y, p), "ci95": [auroc_lo, auroc_hi], "median": auroc_med},
+        "auprc": {"point": auprc(y, p), "ci95": [auprc_lo, auprc_hi], "median": auprc_med},
+        "brier": {"point": brier(y, p), "ci95": [brier_lo, brier_hi], "median": brier_med},
+        "ici": ici(y, p),
+        "decision_curve": decision_curve(y, p),
+    }
+    if groups_sex is not None:
+        report["fairness_sex"] = stratified_metrics(y, p, groups_sex, auroc)
+    if groups_age is not None:
+        report["fairness_age"] = stratified_metrics(y, p, groups_age, auroc)
+    if groups_uf is not None:
+        report["fairness_uf"] = stratified_metrics(y, p, groups_uf, auroc)
+    return report

src/meds_export.py ADDED Viewed

	@@ -0,0 +1,336 @@

+"""MEDS v0.4.1 exporter for DATASUS — audit-proof, interop-ready.
+Verified against:
+  - meds 0.4.1 schemas (DataSchema, CodeMetadataSchema)
+  - https://github.com/Medical-Event-Data-Standard/meds
+  - CLMBR/MOTOR/EHRSHOT/CoMET tokenization conventions
+Code conventions (interop-compatible):
+  - Static (time=None): GENDER//, RACE//, UF//, MUN//, ORPHA//
+  - Birth/Death: MEDS_BIRTH, MEDS_DEATH (reserved)
+  - Diagnoses: ICD10//<cid> (NOT CID10// — interop with OHDSI/Athena)
+  - Hospitalization: SIH//ADM, SIH//DIS (numeric_value=LOS_days on DIS)
+  - Procedures: SIGTAP//<10-digit> (Brazil-local namespace)
+  - Drugs (APAC): APAC//<sigtap> (numeric_value=monthly_cost_brl)
+  - Outpatient (BPA-I): BPAI//<sigtap>
+  - Visits: Visit//{IP, OP, ER} (matches CLMBR convention)
+Outputs canonical MEDS dataset:
+  /out/
+  ├── data/                       # parquet shards by subject
+  │   ├── shard_0.parquet
+  │   └── ...
+  ├── metadata/
+  │   └── codes.parquet           # REQUIRED: every unique code with description + parent_codes
+  └── dataset_metadata.json       # MEDS dataset metadata
+"""
+from __future__ import annotations
+import os
+import json
+import logging
+from collections import defaultdict, Counter
+from datetime import datetime
+from typing import Iterator
+import pyarrow as pa
+import pyarrow.parquet as pq
+import meds
+log = logging.getLogger("gemeo.cdf.meds_export")
+def _parse_date(s) -> datetime | None:
+    """Parse date string from various DATASUS formats."""
+    if s is None: return None
+    s = str(s).strip()
+    if not s or s in ("0", "None", "nan"): return None
+    try:
+        if "-" in s:
+            return datetime.strptime(s[:10], "%Y-%m-%d")
+        if len(s) == 8:
+            return datetime.strptime(s, "%Y%m%d")
+    except ValueError:
+        return None
+    return None
+def _ym(year, month) -> datetime | None:
+    if year is None: return None
+    try:
+        return datetime(int(year), int(month) if month else 1, 1)
+    except (ValueError, TypeError):
+        return None
+def datasus_patient_to_meds_rows(p: dict, subject_id: int) -> list[tuple]:
+    """Convert one DATASUS patient trajectory to a list of MEDS rows.
+    Each row is (subject_id, time, code, numeric_value, text_value).
+    Returns rows ready to write to a parquet shard.
+    """
+    rows = []
+    # ---- Static (time=None) ----
+    if p.get("sex"):
+        rows.append((subject_id, None, f"GENDER//{p['sex']}", None, None))
+    # ORPHA is rare-disease specific (parallel to ICD10)
+    for orpha in p.get("orphas", []):
+        rows.append((subject_id, None, f"ORPHA//{orpha}", None, None))
+    # ---- Birth (use birth_year as Jan 1) ----
+    birth_year = p.get("birth_year")
+    birth_dt = datetime(int(birth_year), 1, 1) if birth_year else None
+    if birth_dt:
+        rows.append((subject_id, birth_dt, "MEDS_BIRTH", None, None))
+    # ---- Events ----
+    for e in p.get("events", []):
+        et = e.get("type")
+        if et == "admission":  # SIH-RD
+            t = _ym(e.get("year"), e.get("month")) or _parse_date(e.get("admission_date"))
+            if not t: continue
+            rows.append((subject_id, t, "SIH//ADM", None, None))
+            rows.append((subject_id, t, "Visit//IP", None, None))
+            cid = e.get("cid_princ", "")
+            if cid: rows.append((subject_id, t, f"ICD10//{cid}", None, None))
+            proc = e.get("primary_procedure")
+            if proc: rows.append((subject_id, t, f"SIGTAP//{proc[:10]}", None, None))
+            los = e.get("los_days")
+            disch_dt = _parse_date(e.get("discharge_date")) or t
+            if e.get("death_during_stay"):
+                rows.append((subject_id, disch_dt, "MEDS_DEATH", None, None))
+            else:
+                rows.append((subject_id, disch_dt, "SIH//DIS",
+                            float(los) if los is not None else None, None))
+        elif et == "treatment":  # APAC-SIA orphan drug
+            t = _ym(e.get("year"), e.get("month"))
+            if not t: continue
+            cid = e.get("cid", "")
+            if cid: rows.append((subject_id, t, f"ICD10//{cid}", None, None))
+            proc = e.get("procedure_code", "")[:10]
+            if proc:
+                cost = e.get("monthly_cost_brl")
+                rows.append((subject_id, t, f"APAC//{proc}",
+                            float(cost) if cost is not None else None, None))
+        elif et == "outpatient_proc":  # BPA-I
+            t = _parse_date(e.get("auth_date")) or _ym(e.get("year"), e.get("month"))
+            if not t: continue
+            cid = e.get("cid", "")
+            if cid: rows.append((subject_id, t, f"ICD10//{cid}", None, None))
+            proc = e.get("procedure_code", "")[:10]
+            if proc:
+                rows.append((subject_id, t, f"BPAI//{proc}", None, None))
+        elif et == "death":  # SIM
+            t = _parse_date(e.get("date_of_death")) or _ym(e.get("year"), e.get("month"))
+            if not t: continue
+            rows.append((subject_id, t, "MEDS_DEATH", None, None))
+            cid = (e.get("cause_cid") or e.get("cid_princ") or e.get("cid", ""))
+            if cid: rows.append((subject_id, t, f"ICD10//{cid}", None, None))
+    # Sort: nulls first (static), then by time
+    rows.sort(key=lambda r: (r[1] is not None, r[1] or datetime(1900, 1, 1)))
+    return rows
+def export_to_meds(patients: list[dict], out_dir: str,
+                   shard_size: int = 5000,
+                   dataset_name: str = "GEMEO-DATASUS",
+                   version: str = "v13"):
+    """Export a list of DATASUS patient trajectories to MEDS v0.4.1 format.
+    Parameters
+    ----------
+    patients : list of dict
+        Each dict must have: patient_id, sex, birth_year, orphas (list),
+        events (list of dicts with 'type', 'year', 'month', etc.)
+    out_dir : str
+        Output directory (will create data/ and metadata/ subdirs)
+    shard_size : int
+        Number of subjects per parquet shard
+    """
+    os.makedirs(f"{out_dir}/data", exist_ok=True)
+    os.makedirs(f"{out_dir}/metadata", exist_ok=True)
+    log.info(f"Exporting {len(patients)} patients to MEDS at {out_dir}")
+    # Map patient_id (string hash) → int64 subject_id (MEDS requires int64)
+    pid_to_sid = {p["patient_id"]: i for i, p in enumerate(patients)}
+    # ---- Stream rows ----
+    all_codes = Counter()
+    shard_idx = 0
+    shard_rows = []
+    n_events = 0
+    n_subjects = 0
+    for p in patients:
+        sid = pid_to_sid[p["patient_id"]]
+        rows = datasus_patient_to_meds_rows(p, sid)
+        shard_rows.extend(rows)
+        n_events += len(rows)
+        n_subjects += 1
+        for r in rows:
+            all_codes[r[2]] += 1
+        # Write shard when full
+        if n_subjects % shard_size == 0 and shard_rows:
+            _write_shard(shard_rows, f"{out_dir}/data/shard_{shard_idx}.parquet")
+            shard_idx += 1
+            shard_rows = []
+    # Write remaining
+    if shard_rows:
+        _write_shard(shard_rows, f"{out_dir}/data/shard_{shard_idx}.parquet")
+    log.info(f"  wrote {shard_idx + 1} data shards, {n_events} rows, {n_subjects} subjects")
+    # ---- codes.parquet (REQUIRED in MEDS v0.4) ----
+    code_rows = []
+    for code, count in all_codes.most_common():
+        # parent_codes: empty for Brazil-local namespaces; populated for ICD10 -> SNOMED if mapped
+        parent_codes = _get_parent_codes(code)
+        code_rows.append({
+            "code": code,
+            "description": _get_description(code, count),
+            "parent_codes": parent_codes,
+        })
+    code_table = pa.Table.from_pylist(code_rows, schema=meds.CodeMetadataSchema.schema())
+    pq.write_table(code_table, f"{out_dir}/metadata/codes.parquet")
+    log.info(f"  wrote metadata/codes.parquet ({len(code_rows)} unique codes)")
+    # ---- dataset_metadata.json ----
+    md = {
+        "dataset_name": dataset_name,
+        "dataset_version": version,
+        "etl_name": "gemeo.cdf.meds_export",
+        "etl_version": "1.0.0",
+        "meds_version": meds.__version__,
+        "n_subjects": n_subjects,
+        "n_events": n_events,
+        "n_unique_codes": len(all_codes),
+        "top_codes": dict(all_codes.most_common(30)),
+    }
+    with open(f"{out_dir}/dataset_metadata.json", "w") as f:
+        json.dump(md, f, indent=2, default=str)
+    log.info(f"  wrote dataset_metadata.json")
+    return md
+def _write_shard(rows: list[tuple], path: str):
+    """Write a list of (subject_id, time, code, numeric_value, text_value) to parquet."""
+    if not rows: return
+    # Build columnar arrays
+    subject_id = pa.array([r[0] for r in rows], type=pa.int64())
+    time = pa.array([r[1] for r in rows], type=pa.timestamp("us"))
+    code = pa.array([r[2] for r in rows], type=pa.string())
+    numeric_value = pa.array([r[3] for r in rows], type=pa.float32())
+    text_value = pa.array([r[4] for r in rows], type=pa.large_string())
+    table = pa.Table.from_arrays(
+        [subject_id, time, code, numeric_value, text_value],
+        names=["subject_id", "time", "code", "numeric_value", "text_value"],
+    )
+    # Validate against MEDS schema
+    expected_schema = meds.DataSchema.schema()
+    # Cast if needed
+    table = table.cast(expected_schema, safe=False)
+    pq.write_table(table, path, compression="zstd")
+# Brazilian-specific mapping tables (extend as needed)
+ICD10_CHAPTERS = {
+    "A": "Certain infectious and parasitic diseases",
+    "B": "Certain infectious and parasitic diseases",
+    "C": "Neoplasms",
+    "D": "Neoplasms / Diseases of the blood and immune",
+    "E": "Endocrine, nutritional and metabolic diseases",
+    "F": "Mental, Behavioral and Neurodevelopmental disorders",
+    "G": "Diseases of the nervous system",
+    "H": "Diseases of the eye / ear",
+    "I": "Diseases of the circulatory system",
+    "J": "Diseases of the respiratory system",
+    "K": "Diseases of the digestive system",
+    "L": "Diseases of the skin and subcutaneous tissue",
+    "M": "Diseases of the musculoskeletal system",
+    "N": "Diseases of the genitourinary system",
+    "O": "Pregnancy, childbirth and the puerperium",
+    "P": "Certain conditions originating in the perinatal period",
+    "Q": "Congenital malformations, deformations and chromosomal abnormalities",
+    "R": "Symptoms, signs and abnormal clinical and laboratory findings",
+    "S": "Injury, poisoning and certain other consequences of external causes",
+    "T": "Injury, poisoning and certain other consequences of external causes",
+    "V": "External causes of morbidity",
+    "W": "External causes of morbidity",
+    "X": "External causes of morbidity",
+    "Y": "External causes of morbidity",
+    "Z": "Factors influencing health status and contact with health services",
+}
+def _get_description(code: str, count: int) -> str:
+    """Generate a brief description for a code (used in codes.parquet)."""
+    if code in ("MEDS_BIRTH",): return "Birth event (reserved)"
+    if code in ("MEDS_DEATH",): return "Death event (reserved)"
+    parts = code.split("//")
+    if len(parts) < 2: return f"Unknown code (n={count})"
+    domain, val = parts[0], "//".join(parts[1:])
+    if domain == "GENDER": return f"Patient sex = {val}"
+    if domain == "ORPHA": return f"Orphanet rare disease {val}"
+    if domain == "ICD10":
+        ch = ICD10_CHAPTERS.get(val[0], "Unknown chapter")
+        return f"ICD-10 {val} ({ch})"
+    if domain == "SIH": return f"SIH hospitalization {val}"
+    if domain == "Visit": return f"Visit type {val}"
+    if domain == "SIGTAP": return f"SIGTAP procedure {val}"
+    if domain == "APAC": return f"APAC orphan-drug authorization {val}"
+    if domain == "BPAI": return f"BPA-I outpatient procedure {val}"
+    if domain == "UF": return f"Residence UF {val}"
+    return f"{domain} code {val}"
+def _get_parent_codes(code: str) -> list[str]:
+    """Return parent codes for ontology hierarchy (currently minimal)."""
+    parts = code.split("//")
+    if len(parts) < 2: return []
+    domain, val = parts[0], "//".join(parts[1:])
+    parents = []
+    if domain == "ICD10" and len(val) >= 3:
+        # ICD-10 chapter as parent
+        chapter = val[0]
+        if chapter in ICD10_CHAPTERS:
+            parents.append(f"ICD10//chapter_{chapter}")
+        # 3-char prefix as parent (e.g., E84.0 → E84)
+        if "." in val:
+            parents.append(f"ICD10//{val.split('.')[0]}")
+        elif len(val) > 3:
+            parents.append(f"ICD10//{val[:3]}")
+    if domain == "SIGTAP" and len(val) >= 4:
+        # 4-digit group as parent (SIGTAP 10-digit → 4-digit group)
+        parents.append(f"SIGTAP//group_{val[:4]}")
+    return parents
+def load_meds_dataset(meds_dir: str) -> dict:
+    """Load a MEDS dataset back from parquet for inspection or downstream processing."""
+    import glob
+    shards = sorted(glob.glob(f"{meds_dir}/data/*.parquet"))
+    tables = [pq.read_table(p) for p in shards]
+    data = pa.concat_tables(tables) if tables else None
+    codes = pq.read_table(f"{meds_dir}/metadata/codes.parquet")
+    md = json.load(open(f"{meds_dir}/dataset_metadata.json"))
+    return {"data": data, "codes": codes, "metadata": md}
+if __name__ == "__main__":
+    # Quick test on real patient data
+    logging.basicConfig(level=logging.INFO,
+                        format="%(asctime)s %(levelname)s %(message)s")
+    PATIENTS = "/tmp/datasus_patient_trajectories_v2.json"
+    if os.path.exists(PATIENTS):
+        patients = json.load(open(PATIENTS))[:50]   # 50 patients smoke test
+        md = export_to_meds(patients, "/tmp/meds_smoke_test")
+        print("\n=== smoke test result ===")
+        print(json.dumps(md, indent=2, default=str))

src/primekg_attention.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""PrimeKG cross-attention — graph-RAG into the Diffusion Forcing denoiser.
+Now uses REAL EDGES from raras-app/data/graph-ml/hetero_graph.json:
+  - disease → has_phenotype → phenotype  (curated phenotype linkage)
+  - disease → associated_with → gene     (causal gene evidence)
+  - gene → interacts_with → gene         (PPI network)
+  - phenotype → is_a → phenotype         (HPO ontology)
+Ego-subgraph BFS:
+  1. Start from disease node (ORPHA → PrimeKG index)
+  2. 1-hop: pull connected phenotypes (top-K by edge weight or count)
+  3. 1-hop: pull connected genes
+  4. 2-hop: gene→gene neighbors (interacting partners)
+  5. Concatenate fused embeddings of all selected nodes → cross-attn context
+Falls back to cosine-similarity if graph not loaded.
+White-space architecture (May 2026):
+  - EHRWorld, CLARITY, Time-Aware G-Transformer all skip KG conditioning
+  - PhenoKG/RareNet use KG for RETRIEVAL (rare disease diagnosis)
+  - We use it for GENERATION (counterfactual trajectory completion)
+"""
+from __future__ import annotations
+import os
+import json
+import logging
+from functools import lru_cache
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+log = logging.getLogger("gemeo.cdf.kg")
+# Try raras-app paths first (richer, including hetero_graph edges + node_texts)
+RARAS_KG_DIR = "/Users/dimas/raras-app/data/graph-ml"
+LOCAL_KG_DIR = os.path.join(
+    os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "data")
+def _kg_path(name: str) -> str:
+    """Prefer raras-app path if available, fall back to local fp16."""
+    raras = os.path.join(RARAS_KG_DIR, name)
+    if os.path.exists(raras):
+        return raras
+    local = os.path.join(LOCAL_KG_DIR, name)
+    return local if os.path.exists(local) else None
+@lru_cache(maxsize=1)
+def load_kg(prefer_raras: bool = True) -> dict | None:
+    """Load PrimeKG: fused embeddings + node ids + edges + texts.
+    Returns dict:
+        emb        : {kind: torch.Tensor(N, 3072)}
+        idx2id     : {kind: {pos: id_str}}
+        id2idx     : {kind: {id_str: pos}}
+        edges      : {edge_type: {'src': [...], 'dst': [...]}}
+        adj        : {edge_type: {src_idx: [dst_idx, ...]}}  -- precomputed
+        texts      : {kind: [str, ...]}  -- aligned to position
+        num_nodes  : {kind: int}
+    """
+    # Try raras-app full file first, then local fp16
+    emb_path = (os.path.join(RARAS_KG_DIR, "fused_embeddings.npz")
+                if prefer_raras and os.path.exists(os.path.join(RARAS_KG_DIR, "fused_embeddings.npz"))
+                else _kg_path("fused_embeddings_fp16.npz"))
+    if not emb_path or not os.path.exists(emb_path):
+        log.warning("PrimeKG fused embeddings not found")
+        return None
+    nids_path = _kg_path("node_ids.json")
+    graph_path = _kg_path("hetero_graph.json")
+    texts_path = _kg_path("node_texts.json")
+    fz = np.load(emb_path)
+    nids = json.load(open(nids_path)) if nids_path else {}
+    graph = json.load(open(graph_path)) if graph_path else {"edges": {}, "num_nodes": {}}
+    texts = json.load(open(texts_path)) if texts_path else {}
+    out = {"emb": {}, "id2idx": {}, "idx2id": {}, "edges": {}, "adj": {},
+           "texts": texts, "num_nodes": graph.get("num_nodes", {})}
+    for kind in ("disease", "phenotype", "gene"):
+        if kind in fz.files:
+            out["emb"][kind] = torch.from_numpy(fz[kind].astype(np.float32))
+            if kind in nids:
+                out["idx2id"][kind] = {int(k): v for k, v in nids[kind].items()}
+                out["id2idx"][kind] = {v: int(k) for k, v in nids[kind].items()}
+    # Build adjacency from edges
+    for edge_type, edata in graph.get("edges", {}).items():
+        adj = {}
+        srcs = edata.get("src", []) if isinstance(edata, dict) else []
+        dsts = edata.get("dst", []) if isinstance(edata, dict) else []
+        for s, d in zip(srcs, dsts):
+            adj.setdefault(int(s), []).append(int(d))
+        out["adj"][edge_type] = adj
+        out["edges"][edge_type] = edata
+    log.info(f"  KG loaded from {emb_path}")
+    log.info(f"  disease={out['emb'].get('disease', torch.empty(0)).shape}, "
+             f"phenotype={out['emb'].get('phenotype', torch.empty(0)).shape}, "
+             f"gene={out['emb'].get('gene', torch.empty(0)).shape}")
+    log.info(f"  edges: {list(out['edges'].keys())}")
+    return out
+def ego_subgraph_real(orpha_code: str, k_pheno: int = 16, k_gene: int = 16,
+                      k_gene_2hop: int = 0, kg: dict | None = None) -> torch.Tensor:
+    """BFS ego-subgraph using REAL PrimeKG edges (not cosine similarity).
+    Returns concatenated embeddings (N, 3072) where:
+      - 1 disease node (the query)
+      - up to k_pheno phenotype nodes (direct edges)
+      - up to k_gene gene nodes (direct edges)
+      - up to k_gene_2hop gene-gene 2-hop neighbors
+    Falls back to cosine similarity if no edges available.
+    """
+    if kg is None:
+        kg = load_kg()
+    if kg is None or "disease" not in kg["emb"]:
+        return None
+    d_id = kg["id2idx"]["disease"].get(str(orpha_code))
+    if d_id is None:
+        return None
+    d_emb = kg["emb"]["disease"][d_id]
+    nodes = [d_emb.unsqueeze(0)]
+    # Phenotype neighbors (via disease__has_phenotype__phenotype)
+    adj = kg["adj"].get("disease__has_phenotype__phenotype", {})
+    pheno_neighbors = adj.get(d_id, [])
+    if pheno_neighbors and "phenotype" in kg["emb"]:
+        pheno_neighbors = pheno_neighbors[:k_pheno]
+        nodes.append(kg["emb"]["phenotype"][pheno_neighbors])
+    elif "phenotype" in kg["emb"]:
+        # Fallback: cosine similarity
+        pool = kg["emb"]["phenotype"]
+        sim = F.cosine_similarity(d_emb.unsqueeze(0), pool, dim=-1)
+        top = sim.topk(min(k_pheno, pool.size(0))).indices
+        nodes.append(pool[top])
+    # Gene neighbors (via disease__associated_with__gene)
+    g_adj = kg["adj"].get("disease__associated_with__gene", {})
+    gene_neighbors = g_adj.get(d_id, [])
+    if gene_neighbors and "gene" in kg["emb"]:
+        gene_neighbors = gene_neighbors[:k_gene]
+        nodes.append(kg["emb"]["gene"][gene_neighbors])
+        # 2-hop: gene-gene neighbors of the genes we just pulled
+        if k_gene_2hop > 0:
+            gg_adj = kg["adj"].get("gene__interacts_with__gene", {})
+            seen = set(gene_neighbors)
+            second_hop = []
+            for g in gene_neighbors:
+                for g2 in gg_adj.get(g, []):
+                    if g2 not in seen:
+                        second_hop.append(g2)
+                        seen.add(g2)
+                        if len(second_hop) >= k_gene_2hop: break
+                if len(second_hop) >= k_gene_2hop: break
+            if second_hop:
+                nodes.append(kg["emb"]["gene"][second_hop])
+    elif "gene" in kg["emb"]:
+        pool = kg["emb"]["gene"]
+        sim = F.cosine_similarity(d_emb.unsqueeze(0), pool, dim=-1)
+        top = sim.topk(min(k_gene, pool.size(0))).indices
+        nodes.append(pool[top])
+    return torch.cat(nodes, dim=0)
+# Keep old API name for backward compat
+ego_subgraph = ego_subgraph_real
+class KGCrossAttention(nn.Module):
+    """Cross-attention from sequence (B, T, d_model) to KG ego (B, N, d_model)."""
+    def __init__(self, d_model: int, n_heads: int = 8, dropout: float = 0.1):
+        super().__init__()
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        self.q_proj = nn.Linear(d_model, d_model, bias=False)
+        self.kv_proj = nn.Linear(d_model, 2 * d_model, bias=False)
+        self.out_proj = nn.Linear(d_model, d_model, bias=False)
+        self.norm_q = nn.LayerNorm(d_model)
+        self.norm_kv = nn.LayerNorm(d_model)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x_seq: torch.Tensor, x_kg: torch.Tensor) -> torch.Tensor:
+        B, T, D = x_seq.shape
+        _, N, _ = x_kg.shape
+        q = self.q_proj(self.norm_q(x_seq))
+        kv = self.kv_proj(self.norm_kv(x_kg))
+        k, v = kv.chunk(2, dim=-1)
+        q = q.reshape(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        k = k.reshape(B, N, self.n_heads, self.head_dim).transpose(1, 2)
+        v = v.reshape(B, N, self.n_heads, self.head_dim).transpose(1, 2)
+        out = F.scaled_dot_product_attention(
+            q, k, v, dropout_p=self.dropout.p if self.training else 0.0)
+        out = out.transpose(1, 2).reshape(B, T, D)
+        return x_seq + self.dropout(self.out_proj(out))
+class KGProjector(nn.Module):
+    """Project 3072-d KG embeddings to d_model with LayerNorm."""
+    def __init__(self, kg_dim: int, d_model: int):
+        super().__init__()
+        self.proj = nn.Sequential(
+            nn.Linear(kg_dim, d_model),
+            nn.GELU(),
+            nn.LayerNorm(d_model),
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.proj(x)
+def build_kg_batch(orpha_strings: list[str], d_model: int,
+                   projector: KGProjector,
+                   k_pheno: int = 16, k_gene: int = 16,
+                   k_gene_2hop: int = 0) -> torch.Tensor:
+    """Build (B, N, d_model) batched KG context for a batch of patient ORPHAs.
+    Falls back to zero context for missing ORPHAs.
+    """
+    kg = load_kg()
+    if kg is None:
+        return torch.zeros(len(orpha_strings), 1, d_model,
+                          device=next(projector.parameters()).device)
+    N = 1 + k_pheno + k_gene + k_gene_2hop
+    egos = []
+    for orpha in orpha_strings:
+        e = ego_subgraph_real(orpha, k_pheno, k_gene, k_gene_2hop, kg)
+        if e is None:
+            e = torch.zeros(N, kg["emb"]["disease"].size(-1))
+        elif e.size(0) < N:
+            pad = torch.zeros(N - e.size(0), e.size(-1))
+            e = torch.cat([e, pad], dim=0)
+        egos.append(e[:N])
+    egos = torch.stack(egos, dim=0)
+    return projector(egos.to(next(projector.parameters()).device))
+def precompute_kg_for_dataset(orpha_codes: list[str], projector: KGProjector,
+                              k_pheno: int = 16, k_gene: int = 16,
+                              batch_size: int = 32) -> torch.Tensor:
+    """Pre-compute KG context for an entire dataset in batches.
+    Returns (N_patients, kg_nodes, d_model) tensor on projector device.
+    Saves to disk-cacheable format.
+    """
+    out = []
+    for i in range(0, len(orpha_codes), batch_size):
+        batch = orpha_codes[i:i + batch_size]
+        ctx = build_kg_batch(batch, projector.proj[0].out_features,
+                             projector, k_pheno, k_gene)
+        out.append(ctx.cpu())
+    return torch.cat(out, dim=0)

src/sample.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""Sampling primitives for CDF: AR mode, denoise mode, counterfactual rollouts.
+Diffusion Forcing flexibility — the same model handles:
+  AR mode:
+    Sigma_future = 1, sigma_past = 0. Roll forward like an autoregressive
+    transformer but with per-token noise control.
+  Denoise mode (bidirectional):
+    Sigma low everywhere. Run k denoise steps, model fills the whole sequence.
+  Counterfactual mode (the TTE primitive):
+    Sigma=0 on observed tokens (clamp them clean), sigma=1 on tokens to
+    generate. Condition on (cohort, intervention_action_id). Sample N times,
+    compare distributions of outcome tokens.
+CFG (classifier-free guidance) wraps any mode:
+    logits_g = (1 + gamma) * logits(c) - gamma * logits(null_c)
+Shortcut Forcing (Dreamer 4) reduces denoise steps from 32-64 to 4 via
+distilled student model — implemented in distill.py.
+"""
+from __future__ import annotations
+import logging
+import torch
+import torch.nn.functional as F
+from .diffusion_forcing import CDFTransformer
+log = logging.getLogger("gemeo.cdf.sample")
+@torch.no_grad()
+def sample_denoise(
+    model: CDFTransformer,
+    cond: torch.Tensor,
+    *,
+    seed_prefix: torch.Tensor | None = None,
+    observed_mask: torch.Tensor | None = None,    # (B, T) True = clamped clean
+    action: torch.Tensor | None = None,
+    gamma: float = 2.0,
+    n_steps: int = 32,
+    null_cond: int = 0,
+    schedule: str = "cosine",
+) -> torch.Tensor:
+    """Denoise-mode sampling: fully-masked sequence + iterative refinement.
+    Supports:
+      - seed_prefix: clean tokens kept at sigma=0 for positions [0, L)
+      - observed_mask: arbitrary positions to clamp (counterfactual mode)
+      - CFG via (cond, null_cond) pair
+    """
+    cfg = model.cfg
+    device = cond.device
+    B = cond.size(0)
+    T = cfg.max_seq_len
+    # Init with MASK
+    x = torch.full((B, T), cfg.mask_token, device=device, dtype=torch.long)
+    fixed_mask = torch.zeros(B, T, dtype=torch.bool, device=device)
+    if seed_prefix is not None:
+        L = seed_prefix.size(1)
+        x[:, :L] = seed_prefix
+        fixed_mask[:, :L] = True
+    if observed_mask is not None:
+        fixed_mask |= observed_mask
+    # Build noise schedule
+    if schedule == "cosine":
+        # smooth cosine from 1 -> 0
+        ts = torch.cos(torch.linspace(0, torch.pi/2, n_steps+1, device=device))
+    else:
+        ts = torch.linspace(1.0, 0.0, n_steps+1, device=device)
+    null = torch.full_like(cond, null_cond)
+    null_action = (torch.full_like(action, cfg.n_latent_actions)
+                   if action is not None and cfg.use_latent_action else None)
+    for k in range(n_steps):
+        # Per-token sigma: fixed positions at 0, dynamic positions at ts[k]
+        sigma = torch.where(fixed_mask, torch.zeros_like(ts[k:k+1]).expand(B, T),
+                            torch.full((B, T), ts[k].item(), device=device))
+        logits_c = model(x, sigma, cond, action)
+        if gamma > 0:
+            logits_n = model(x, sigma, null, null_action)
+            logits = (1 + gamma) * logits_c - gamma * logits_n
+        else:
+            logits = logits_c
+        logits[:, :, cfg.mask_token] = -1e9
+        probs = F.softmax(logits, dim=-1)
+        confs, preds = probs.max(dim=-1)
+        # Confidence-based remasking: reveal top-(1 - ts[k+1]) fraction of free tokens
+        t_next = ts[k+1].item()
+        target_kept = int(round((1 - t_next) * T))
+        revealed = (x != cfg.mask_token) | fixed_mask
+        already = revealed.sum(dim=-1)
+        new_x = x.clone()
+        for b in range(B):
+            need = max(0, target_kept - int(already[b].item()))
+            if need == 0:
+                continue
+            confs_b = torch.where(revealed[b], torch.full_like(confs[b], -1e9), confs[b])
+            topi = confs_b.topk(need).indices
+            new_x[b, topi] = preds[b, topi]
+        x = new_x
+    # Final cleanup
+    mask_left = x == cfg.mask_token
+    if mask_left.any():
+        sigma_final = torch.zeros(B, T, device=device)
+        logits_c = model(x, sigma_final, cond, action)
+        if gamma > 0:
+            logits_n = model(x, sigma_final, null, null_action)
+            logits = (1 + gamma) * logits_c - gamma * logits_n
+        else:
+            logits = logits_c
+        logits[:, :, cfg.mask_token] = -1e9
+        preds = logits.argmax(-1)
+        x = torch.where(mask_left, preds, x)
+    return x
+@torch.no_grad()
+def sample_ar(
+    model: CDFTransformer,
+    cond: torch.Tensor,
+    prefix: torch.Tensor,
+    *,
+    action: torch.Tensor | None = None,
+    max_new: int = 50,
+    temperature: float = 1.0,
+    gamma: float = 0.0,
+    null_cond: int = 0,
+) -> torch.Tensor:
+    """AR-mode sampling: future tokens at sigma=1, past at sigma=0.
+    Faster than denoise mode when you only want to continue a prefix.
+    """
+    cfg = model.cfg
+    device = cond.device
+    B = cond.size(0)
+    x = prefix.clone().to(device)
+    if x.dim() == 1: x = x.unsqueeze(0)
+    null = torch.full_like(cond, null_cond)
+    null_action = (torch.full_like(action, cfg.n_latent_actions)
+                   if action is not None and cfg.use_latent_action else None)
+    for _ in range(max_new):
+        T_now = x.size(1)
+        if T_now >= cfg.max_seq_len:
+            break
+        # Pad with MASK
+        x_pad = torch.cat([x, torch.full((B, 1), cfg.mask_token,
+                                          device=device, dtype=torch.long)], dim=1)
+        sigma = torch.zeros(B, T_now + 1, device=device)
+        sigma[:, -1] = 1.0
+        a_pad = None
+        if action is not None and cfg.use_latent_action:
+            a_pad = torch.cat([action[:, :T_now],
+                               torch.full((B, 1), cfg.n_latent_actions,
+                                          device=device, dtype=torch.long)], dim=1)
+        logits = model(x_pad, sigma, cond, a_pad)
+        if gamma > 0:
+            logits_n = model(x_pad, sigma, null, null_action)
+            logits = (1 + gamma) * logits - gamma * logits_n
+        logits[:, :, cfg.mask_token] = -1e9
+        p = F.softmax(logits[:, -1] / max(temperature, 1e-3), dim=-1)
+        nxt = torch.multinomial(p, 1)
+        x = torch.cat([x, nxt], dim=1)
+    return x
+@torch.no_grad()
+def counterfactual_rollout(
+    model: CDFTransformer,
+    seed_prefix: torch.Tensor,
+    treatment_cond: int,
+    untreated_cond: int,
+    *,
+    treatment_action: int | None = None,
+    untreated_action: int | None = None,
+    n_samples: int = 100,
+    gamma: float = 2.0,
+    n_steps: int = 32,
+) -> dict:
+    """Sample paired counterfactual trajectories under treatment vs no-treatment.
+    Two ways to specify the intervention:
+      - via cond id (cohort-level): treatment_cond / untreated_cond
+      - via latent action id (per-token): treatment_action / untreated_action
+    """
+    cfg = model.cfg
+    device = next(model.parameters()).device
+    seed = seed_prefix.unsqueeze(0).expand(n_samples, -1).to(device)
+    T = cfg.max_seq_len
+    cond_tx = torch.full((n_samples,), treatment_cond, device=device, dtype=torch.long)
+    cond_null = torch.full((n_samples,), untreated_cond, device=device, dtype=torch.long)
+    action_tx = action_null = None
+    if cfg.use_latent_action:
+        action_tx = torch.full((n_samples, T),
+                               treatment_action if treatment_action is not None
+                               else cfg.n_latent_actions,
+                               device=device, dtype=torch.long)
+        action_null = torch.full((n_samples, T),
+                                 untreated_action if untreated_action is not None
+                                 else cfg.n_latent_actions,
+                                 device=device, dtype=torch.long)
+    traj_tx = sample_denoise(model, cond_tx, seed_prefix=seed,
+                              action=action_tx, gamma=gamma, n_steps=n_steps)
+    traj_null = sample_denoise(model, cond_null, seed_prefix=seed,
+                                action=action_null, gamma=gamma, n_steps=n_steps)
+    return {
+        "traj_treated": traj_tx, "traj_untreated": traj_null,
+        "n": n_samples, "treatment_cond": treatment_cond,
+        "untreated_cond": untreated_cond, "gamma": gamma,
+    }
+def outcome_rate(traj: torch.Tensor, target_ids: list[int]) -> float:
+    if not target_ids:
+        return 0.0
+    target = torch.tensor(target_ids, device=traj.device)
+    has = (traj.unsqueeze(-1) == target).any(dim=(-1, -2))
+    return has.float().mean().item()

src/wsd_scheduler.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""WSD (Warmup-Stable-Decay) LR scheduler — manual implementation.
+Per MiniCPM (Hu et al. 2024) and the data-constrained scaling literature:
+  Phase 1 (warmup, 1-5% of total_steps): linear 0 → peak_lr
+  Phase 2 (stable, 60-80%):              constant peak_lr
+  Phase 3 (decay, 10-25%):               linear or 1/sqrt to peak_lr * 0.1
+Beats cosine for:
+  - data-limited regimes (we can extend stable phase if loss still falls)
+  - continue-pretrain (sharp decay enables clean fine-tune handoff)
+"""
+from __future__ import annotations
+import math
+import torch
+from torch.optim.lr_scheduler import LambdaLR
+def wsd_lr_schedule(step: int, total_steps: int,
+                    warmup_steps: int = 500,
+                    stable_frac: float = 0.80,
+                    decay_frac: float = 0.15,
+                    min_lr_ratio: float = 0.1,
+                    decay_type: str = "linear") -> float:
+    """Return LR multiplier in [min_lr_ratio, 1.0] for a given step."""
+    if step < warmup_steps:
+        return step / max(1, warmup_steps)
+    # remainder of steps after warmup
+    remaining = total_steps - warmup_steps
+    if remaining <= 0:
+        return 1.0
+    stable_steps = int(stable_frac * remaining)
+    decay_steps = int(decay_frac * remaining)
+    pos = step - warmup_steps
+    if pos < stable_steps:
+        return 1.0
+    decay_pos = pos - stable_steps
+    if decay_pos >= decay_steps:
+        return min_lr_ratio
+    progress = decay_pos / max(1, decay_steps)
+    if decay_type == "linear":
+        return 1.0 - (1.0 - min_lr_ratio) * progress
+    elif decay_type == "cosine":
+        return min_lr_ratio + 0.5 * (1 - min_lr_ratio) * (1 + math.cos(math.pi * progress))
+    elif decay_type == "inv_sqrt":
+        return max(min_lr_ratio, 1.0 / math.sqrt(1 + progress * 10))
+    else:
+        raise ValueError(f"unknown decay_type: {decay_type}")
+def get_wsd_scheduler(optimizer: torch.optim.Optimizer,
+                      total_steps: int,
+                      warmup_steps: int = 500,
+                      stable_frac: float = 0.80,
+                      decay_frac: float = 0.15,
+                      min_lr_ratio: float = 0.1,
+                      decay_type: str = "linear") -> LambdaLR:
+    """Build a LambdaLR scheduler with WSD schedule."""
+    def fn(step):
+        return wsd_lr_schedule(step, total_steps, warmup_steps,
+                              stable_frac, decay_frac, min_lr_ratio, decay_type)
+    return LambdaLR(optimizer, lr_lambda=fn)
+if __name__ == "__main__":
+    # Visualize the schedule
+    total = 10000
+    warmup = 500
+    print(f"WSD schedule preview: total={total}, warmup={warmup}, stable=80%, decay=15%")
+    print(f"  step    lr_mult")
+    for s in [0, 250, 500, 1000, 5000, 8000, 8500, 9000, 9500, 9800, 9999]:
+        m = wsd_lr_schedule(s, total, warmup, 0.80, 0.15, 0.1, "linear")
+        print(f"  {s:>5}   {m:.4f}")