File size: 22,890 Bytes

31e2456

# PhysioJEPA: Learning Cardiovascular Dynamics via Time-Shifted Cross-Modal Prediction
*Oz Labs — Full Research Development Document — April 2026*
*Revision 2: post-reviewer critique. Replaces causalcardio_jepa_full.md*

---

## Change log from revision 2 (post-E0 audit, 2026-04-14)

- ECG input revised from 12-lead @ 500 Hz to **single lead II @ 250 Hz** (lead II present in 93.7% of HF-mirror segments; 12-lead not available in this dataset)
- ECG patch size revised: 200 ms = **50 samples @ 250 Hz**, 1D over single lead (was 2D (12, 25) @ 500 Hz)
- AF label source locked to **PTB-XL** (see `docs/af_label_decision.md`): MIMIC-IV-ECG path blocked by (a) ~381-patient cohort yielding <100 AF-positive, (b) missing PhysioNet credentialing. Paper now frames AF eval as a transfer claim
- PPG encoding locked to **raw patches** for v1 per E1 Stage-1 result (extraction rate 98.6% but Stage-2 probe deferred to ablation A1 when AF labels are integrated)
- Baseline A (ECG-JEPA) cannot load Weimann's 12-lead PTB-XL checkpoints; must retrain from scratch on single-lead II to be an honest comparison

## Change log from revision 1

- Renamed throughout from CausalCardio-JEPA → PhysioJEPA
- Core claim simplified to one sentence; PTT demoted from contribution to validation signal
- v1 architecture stripped to minimum: raw PPG patches, EMA only, no cardiac phase encoding, no SIGReg
- Morphological encoding, cardiac phase encoding, SIGReg moved to labelled ablations
- "Causal" language replaced throughout with "physiologically informed asymmetry" or "directional asymmetry"
- AnyPPG characterisation corrected: ECGFounder encoder is frozen during AnyPPG training
- Venue targets corrected to reflect actual 2026 deadlines
- PTT head reframed: validation signal, not contribution

---

## 1. The Hypothesis

**Core claim — one sentence:**

> Predicting PPG at a variable time offset Δt from ECG produces cardiovascular representations that encode vascular timing structure, while contrastive alignment at t=0 and predictive alignment at t=0 both destroy this structure.

**What this means concretely:**
After self-supervised pretraining on synchronized ECG+PPG without labels, the model should:

1. Predict PPG windows N beats ahead from ECG context with lower error than predicting mean PPG — the model is actually learning something
2. Outperform a symmetric JEPA trained at Δt=0 on downstream cardiovascular tasks — the temporal offset matters
3. Produce latent embeddings where PTT (measured post-hoc from the latent's optimal Δt) correlates with ground-truth PTT from peak detection — PTT is implicitly encoded
4. Show physiologically consistent rollout: predicted optimal Δt varies inversely with heart rate and directly with blood pressure categories

Points 1 and 2 are the paper. Points 3 and 4 are the supporting evidence.

**Why this is different from existing methods:**

Every prior cross-modal ECG-PPG method treats the two modalities as symmetric windows on the same cardiac state at the same moment:

- **AnyPPG** (Nie et al., 2511.01747): symmetric InfoNCE at t=0. Important nuance: the ECGFounder encoder is *frozen* during AnyPPG training — it functions as a fixed supervisory signal, not a jointly-learned representation. This means AnyPPG is not even learning a shared representation; it is distilling a frozen ECG model into a PPG encoder. Same-time alignment still applies.
- **TSTA-Net** (Liu et al., PMLR 2025): hierarchical contrastive learning with spatiotemporal alignment of ECG and PPG. Same-time alignment.
- **PPGFlowECG** (Fang et al., 2509.19774): uses InfoNCE instance alignment internally in Stage 1, then rectified flow generation in Stage 2. Both stages operate at t=0 alignment.
- **CardioGAN** (Sarkar & Etemad, AAAI 2021): CycleGAN-based adversarial waveform synthesis. Pixel-space signal translation, not representation learning. t=0.

All of them discard the ECG→PPG lag. The lag is the measurement: PTT ≈ 100–400ms encodes arterial stiffness, which encodes blood pressure via the Moens-Korteweg equation. PPGFlowECG even acknowledges this in Figure 1 ("ventricular electrical activation precedes the peripheral pulse") but their architecture doesn't use it.

**Why JEPA specifically:**

JEPA's implicit bias — shown formally by Balestriero & LeCun (LeJEPA, 2511.08544) and empirically by Weimann & Conrad (2410.13867) — is toward high-influence, predictable features. In a cardiac signal, the most stable and predictable cross-modal feature is the time-shifted PPG peak following the QRS complex. JEPA will naturally attend to this; symmetric InfoNCE cannot because it penalises the model for not aligning ECG(t) with PPG(t), actively destroying the lag information in order to minimise the contrastive loss.

---

## 2. Architecture

### v1 (what runs in the experiment matrix)

The minimum architecture needed to test the core claim. No unnecessary complexity.

```
INPUT  (revised post-E0, 2026-04-14)
───────────────────────────────────────────────────────
ECG:  [B, 1, 2500]   — lead II, 10s @ 250Hz (native HF-mirror rate)
PPG:  [B, 1, 1250]   — fingertip PPG (Pleth), 10s @ 125Hz (native)
Temporal alignment: sample-accurate (shared segment clock per HF record)

PREPROCESSING
───────────────────────────────────────────────────────
ECG:  bandpass 0.5–40 Hz → z-score normalisation per window
      R-peak detection (Pan-Tompkins) only used for PTT ground truth,
      not consumed by the encoder

PPG:  bandpass 0.5–8 Hz → z-score normalisation
      [v1: raw patches only — no morphological extraction]

Segments without lead II (~6.3%) are dropped.

TOKENISATION
───────────────────────────────────────────────────────
ECG context encoder:
  - 1D patch: 50 samples = 200ms @ 250Hz
  - 50 patches per 10s window
  - Linear projection → d=256
  - 1D sinusoidal positional encoding (time)
  [v1: single-lead; multi-lead 2D is deferred — only II/V/aVR consistently
   present, and the Δt claim is lead-agnostic]

PPG target encoder:
  - 1D patch: 25 samples = 200ms per patch
  - 60 patches per 10s window
  - Linear projection → d=256
  - 1D sinusoidal positional encoding
  [v1: raw patches — not morphological tokens]

ECG CONTEXT ENCODER  E_e
───────────────────────────────────────────────────────
ViT-S (adapted from Weimann & Conrad ECG-JEPA, 1D instead of 2D)
  12 transformer layers, d=256, 8 heads, MLP ratio=4
  I-JEPA masking within ECG (multi-block, 50% ratio) for auxiliary loss
  EMA updated: τ annealed 0.996→0.9999 over first 30% of training
  Note: cannot load Weimann's published 12-lead checkpoints directly;
  Baseline A retrains from scratch on single-lead II for fair comparison

PPG TARGET ENCODER  E_p  [EMA updated]
───────────────────────────────────────────────────────
ViT-T (lighter: 6 layers, d=256)
  No masking — encodes full PPG window as target
  EMA updated: same τ schedule as E_e
  [v1: EMA only — SIGReg is an ablation, not v1]

Δt EMBEDDING
───────────────────────────────────────────────────────
Scalar Δt ∈ [50ms, 500ms] → sinusoidal encoding → R^64
Linear projection → R^256
Added to predictor as conditioning token

CAUSAL PREDICTOR  P
───────────────────────────────────────────────────────
4-layer cross-attention transformer
  Query:    positional tokens for target PPG window positions
  Key/Val:  ECG context latents z_e + Δt conditioning token
  Output:   predicted PPG latent ẑ_p(t+Δt)

The predictor sees no PPG input — only ECG latents + Δt.
This is the architectural enforcement of directional asymmetry.

LOSS FUNCTION (v1)
───────────────────────────────────────────────────────
L_total = L_cross + 0.3 * L_self

L_cross = L1(ẑ_p(t+Δt),  z_p(t+Δt))   ← main prediction loss
L_self  = L1(ẑ_e_masked, z_e_target)   ← auxiliary ECG self-prediction

[v1: no SIGReg, no PTT head in training loop]

Δt SAMPLING
───────────────────────────────────────────────────────
Per batch:
  60% log-uniform in [50ms, 500ms]
  40% ground-truth PTT measured from aligned dataset
```

### Ablations (not v1 — run after E3 passes K2)

| Ablation | What changes | What it tests |
|----------|-------------|---------------|
| A1: Morphological PPG | PPG target encoder uses morphological tokens instead of raw patches | Does structured PPG encoding improve latent quality? |
| A2: Cardiac phase encoding | Add beat-phase positional encoding (P/QRS/ST/T) to ECG encoder | Does phase-aware PE beat standard 2D sinusoidal? |
| A3: SIGReg instead of EMA | Replace EMA with SIGReg (Balestriero & LeCun 2511.08544) | Is SIGReg more stable than EMA on cardiac signals? |
| A4: Joint PTT head | Add PTT regression MLP head to training loss (γ=0.1) | Does supervised PTT signal improve latent vascular encoding? |
| A5: Curriculum Δt | Start with ground-truth PTT only, introduce log-uniform Δt after 30% training | Does curriculum scheduling improve PTT coherence? |

---

## 3. Required Resources

### Compute
- **E0–E2 (baseline suite)**: ~10 GPU-hours (3 baselines × 20 epochs × small data)
- **E3 (full training)**: ~48–72 hours on A100/H100 for 100 epochs
- **E4–E6**: ~10 GPU-hours (frozen encoder probes + ablations)
- **Full ablation suite (A1–A5)**: ~5 × 24h = 120 hours
- **Total to paper-ready**: ~200 GPU-hours ≈ $500 on Runpod H100

### Data
Primary: `lucky9-cyou/mimic-iv-aligned-ppg-ecg` (HuggingFace, instant)
Fallback (if E0 fails): PhysioNet BIDMC (ECG+PPG, documented alignment, open access)
PTT validation: MIMIC-BP curated dataset (UCL/UCI, 1,524 patients)

### Software
- Base codebase: `kweimann/ECG-JEPA` (MIT licence)
- PPG peak detection: `wfdb` + `scipy.signal`
- SIGReg (ablation A3): ~50 lines PyTorch, implement from Balestriero & LeCun 2511.08544
- Evaluation: `sklearn` linear probe + custom rollout harness

### People and timeline
- Guy: architecture, training loop, paper
- Zack: data pipeline, PPG encoder, evaluation harness
- Weeks 1–2: E0→E3 (go/no-go on K2)
- Weeks 3–4: E4→E6 + ablations (if green)
- Weeks 5–8: writing

---

## 4. Execution plan

See the experiment matrix document (`physiojep_experiment_matrix.md`) for day-by-day detail. Summary:

| Days | Task | Gate |
|------|------|------|
| 1–2 | E0: data audit | Dataset go/no-go |
| 3 | E1: PPG encoding decision | Architecture lock |
| 4–5 | E2: baseline suite | Floor + ceiling |
| 6–8 | E3: PhysioJEPA v1 | K1/K2/K3 at epoch 25 |
| 9–10 | E4: rollout coherence | World model evidence |
| 11–12 | E5: downstream probes | PTT/AF/HR numbers |
| 13–14 | E6: decisive ablation (Δt vs Δt=0) | Table 1 of paper |
| 15 | Green/yellow/red decision | What gets written |

---

## 5. Pitfalls and Failure Modes

### Pitfall 1: Dataset alignment coarser than 50ms
**Probability**: Medium. HuggingFace mirror is undocumented.
**Symptom**: PTT ground-truth variance >100ms within-patient
**Response**: Pivot to PhysioNet BIDMC immediately (2-day delay)
**Impact on claim**: Architecture identical; only provenance label changes

### Pitfall 2: Morphological PPG feature extraction unreliable
**Note**: This is now an ablation (A1), not v1. If E1 shows morphological encoding is unreliable, we simply don't run A1. This is no longer a project-killing risk.

### Pitfall 3: EMA collapse
**Probability**: Low. ECG-JEPA with EMA is validated at scale.
**Symptom**: Mean cosine sim >0.99 for 500 consecutive steps
**Response**: Reduce τ start to 0.99, check batch size; add SIGReg (ablation A3) earlier
**Monitoring**: Log every 100 steps from epoch 1

### Pitfall 4: Cross-modal loss never beats mean baseline (K1)
**Probability**: Low-medium. Depends on dataset quality.
**Symptom**: L_cross plateau above 0.85× mean-PPG-latent baseline
**Response**: Check data quality, increase window overlap, verify EMA schedule
**Nuclear option**: Pivot to Architecture A (temporal ECG-JEPA, unimodal) — reuses all code

### Pitfall 5 (critical): Δt-aware ≈ t-aligned (K2)
**Probability**: Unknown — this is the central empirical question.
**Symptom**: E3 AUROC ≈ Baseline B AUROC (within 0.02)
**Response**: This is the K2 failure mode. The core claim is wrong on this data at this scale.
**Pivot options**: Architecture A, Study 4 (anomaly detection), or re-run on BIDMC

### Pitfall 6: Shortcut learning
**Probability**: Medium, especially early in training.
**Symptom**: Model predicts mean PPG morphology for all inputs; L_cross decreases but predictions are identical regardless of ECG input
**Detection**: Compute per-patient prediction variance — if near zero, shortcut is occurring
**Response**: Increase batch diversity, add within-patient hard negatives to Δt sampling

### Pitfall 7: PTT coherence fails (E4 passes but PTT probe fails)
**Probability**: Low-medium.
**Implication**: The temporal structure is encoded nonlinearly. Try 3-layer MLP probe instead of linear. If that fails, this is a limitation — remove PTT probe from paper claims but keep E4 rollout coherence evidence.

---

## 6. Checkpoints

| # | When | Pass criterion | Fail action |
|---|------|----------------|-------------|
| C1 | Day 2 | Alignment ≤50ms; ≥500 patients; missing ≤20% | Pivot to BIDMC |
| C2 | Day 3 | E1 decision made and committed | Block on architecture |
| C3 | Day 5 | Baseline B training stable (no collapse) | Add SIGReg to E3 from start |
| C4 | Day 8 (epoch 25) | K1: L_cross < 0.85× mean baseline | Fix or exit |
| C5 | Day 8 (epoch 25) | K2: E3 AUROC > Baseline B + 0.02 | Paper doesn't exist |
| C6 | Day 8 (epoch 25) | K3: E3 AUROC ≥ Baseline A − 0.01 | Reduce PPG encoder capacity |
| C7 | Day 10 | E4: Spearman(optimal Δt, ground-truth PTT) > 0.30 | Keep as limitation |
| C8 | Day 12 | E5: PTT probe MAE < naive by 20% | 3-layer MLP probe fallback |
| C9 | Day 14 | E6: Δt>0 > Δt=0 on ≥2 of 3 metrics | Re-examine K2 |

---

## 7. Evaluation Protocol

### Primary metrics (determine the paper)

**E3 / E6 — Core claim test**

| Metric | What it tests | Baseline |
|--------|--------------|---------|
| AF detection AUROC (linear probe, frozen) | Representation quality | ECG-JEPA: 0.945 (Weimann 2410.13867) |
| HR regression R² (linear probe, frozen) | Cardiovascular signal content | RR-interval baseline |
| ECG-PPG retrieval R@1 | Cross-modal alignment | AnyPPG: 0.736 |

**E4 — World model evidence (rollout coherence)**

| Check | Pass criterion |
|-------|---------------|
| Spearman(optimal Δt, measured PTT) | > 0.30 |
| HR-PTT inverse ordering | Significant, p < 0.05 |
| U-shaped prediction error curve | ≥60% of patients |

**E5 — Downstream validation**

| Task | Metric | Framing |
|------|--------|---------|
| PTT regression (linear probe) | MAE (ms) vs naive | Validation only — not the contribution |
| AF sample efficiency | AUROC at 1/5/10/100% labels | JEPA sample efficiency advantage |

### Evaluation philosophy

Table 1 of the paper (from E6): a 4-row × 4-column table showing Baseline A (ECG-JEPA), Baseline B (Δt=0), Baseline C (InfoNCE), and PhysioJEPA across AF AUROC, HR R², PTT correlation, and retrieval R@1. If rows 3 and 4 are clearly separated, the paper exists.

The PTT probe and rollout coherence are supporting figures. They interpret why the representation quality is better. They do not constitute the primary claim.

---

## 8. Critic — Strongest Arguments Against

### Critic 1: PTT can be computed with peak detection in 10 lines of code

**Correct.** That is exactly why PTT is a *validation signal*, not the contribution. We are not claiming novelty in PTT computation. We are claiming that a model trained on the Δt prediction objective implicitly encodes PTT in its latent space — which is evidence that the latent captures vascular dynamics rather than just cardiac rhythm. If the same latent did *not* encode PTT, we would doubt that it learned anything physiologically meaningful.

### Critic 2: Small dataset vs AnyPPG's 100k+ hours

**Conceded.** We are not competing at scale. The comparison is controlled: PhysioJEPA vs Baseline C (InfoNCE) trained on the same N hours. The architectural claim is about inductive bias on fixed data, not about scale. We report this comparison explicitly.

### Critic 3: "Physiological asymmetry" is just an architectural choice, not a principled claim

**Partially conceded.** The architecture encodes a *hypothesis* about the direction of information flow (ECG→PPG). If the ablation (Baseline B, symmetric at Δt>0) performs identically to PhysioJEPA, the asymmetry contributed nothing and we remove it from the contribution list. The ablation is the test.

### Critic 4: The Δt sampling mixing ratio (60/40) is a hyperparameter

**Correct.** Ablation A5 (curriculum Δt) tests whether this specific ratio matters. For v1 we use 60/40 pragmatically; if A5 shows a different schedule is better, we adopt it. This is not a fundamental weakness — it is a hyperparameter like any other.

### Critic 5: Shortcut — the model predicts mean PPG for all inputs

**Real risk.** Explicitly monitored via per-patient prediction variance (Pitfall 6). If detected, addressed before any results are reported.

---

## 9. Reviewer Critiques (updated post-feedback)

The reviewer critique document (provided separately) raised five structural issues. Status of each:

| Issue | Status | Resolution |
|-------|--------|-----------|
| 3 contributions in 1 paper | Fixed | Core claim reduced to one sentence; PTT and morphology are evidence/ablations |
| PTT head framing backwards | Fixed | PTT is validation signal; cross-modal Δt prediction is the claim |
| Morphological encoding = #1 technical risk | Fixed | Moved to ablation A1; not in v1 |
| "Causal" overclaimed | Fixed | Renamed to PhysioJEPA; language changed to "directional asymmetry" / "physiologically informed" |
| Core idea not isolated | Fixed | E3 vs Baseline B (Δt=0) is the controlled isolation; both are identical except Δt |
| Baselines needed from Week 1 | Fixed | E2 baseline suite runs days 4–5, before E3 |
| "World model" evaluation missing | Fixed | E4 rollout coherence is explicit and uses physiological consistency checks |

---

## 10. Open Questions

**Q1: How well is the MIMIC-IV aligned PPG-ECG dataset actually aligned?**
Unknown until E0. The most important unanswered question. Answer by Day 2.

**Q2: Does the asymmetric architecture (ECG predicts PPG, not PPG predicts ECG) outperform the symmetric version?**
This is ablation A1's question at the architecture level. Baseline B isolates Δt but not directionality — if we add a symmetric Δt>0 variant (PPG predicts ECG with the same lag), we can test this separately. Lower priority; add if time permits.

**Q3: Does the cross-modal training improve the ECG encoder relative to ECG-only training?**
K3 tests this: E3 AUROC should match Baseline A (ECG-JEPA alone). If it's worse, the cross-modal objective is hurting the ECG representation. This would be a significant negative result worth reporting.

**Q4: How does the model behave during AF?**
AF removes the periodic P-wave and makes RR intervals irregular. The Δt sampling may fail to find a meaningful optimal during AF episodes. This is actually interesting — the model's inability to predict a stable optimal Δt during AF could itself be a detection signal. Monitor in E4.

**Q5: Is MIMIC-BP the right held-out dataset for PTT validation?**
MIMIC-BP (Kachuee et al.) is derived from MIMIC-III; the training data is MIMIC-IV-derived. Same institution (BIDMC), no patient overlap, but similar population. This is a reasonable evaluation setup but should be documented carefully to pre-empt reviewer concerns about distribution leakage.

---

## 11. Paper Identity and Venues

**Title**: *PhysioJEPA: Learning Cardiovascular Dynamics via Time-Shifted Cross-Modal Prediction*

**One-paragraph abstract (draft)**:
Contrastive self-supervised methods for ECG-PPG representation learning align same-time signals in a shared embedding space, discarding the physiological lag between cardiac electrical activation and peripheral perfusion. This lag — the pulse transit time (PTT) — encodes arterial stiffness and correlates with blood pressure. We introduce PhysioJEPA, a JEPA-based world model that instead trains an ECG encoder to predict PPG latents at a variable time offset Δt, preserving and exploiting the directional temporal structure that contrastive methods destroy. We show that Δt-aware prediction produces cardiovascular representations that (1) outperform same-time contrastive alignment on AF detection sample efficiency, (2) implicitly encode PTT without label supervision — demonstrated via rollout coherence tests and linear probing — and (3) transfer more efficiently from limited labelled data than InfoNCE-trained baselines. Code and models are released under an open licence.

**Venue targets (updated with real 2026 deadlines)**:

| Venue | Deadline | Type | Fit |
|-------|----------|------|-----|
| NeurIPS 2026 workshops (TS4H, BrainBodyFM) | ~August 2026 | Workshop (non-archival) | Strong — 4-page format, time series + health |
| ML4H 2026 | ~September 2026 (estimated from 2025 pattern) | Symposium (archival proceedings track) | Strong — healthcare ML focus, 8 pages |
| ICLR 2027 | ~October 2026 | Conference (archival) | Stretch — needs clean ablations and strong Table 1 |
| NeurIPS 2026 main | May 6, 2026 | Conference (archival) | Too soon — experiment matrix runs through mid-May |

**Realistic path**: NeurIPS 2026 workshop (TS4H) as the first landing point (~August deadline, results from experiment matrix available by then); ML4H 2026 as the archival target; ICLR 2027 as stretch if the rollout coherence result is strong.

---

*Document revision 2 — April 2026*
*All "CausalCardio-JEPA" references replaced. Reviewer feedback incorporated.*
*Active documents: this file + physiojep_experiment_matrix.md*