# PhysioJEPA: Learning Cardiovascular Dynamics via Time-Shifted Cross-Modal Prediction *Oz Labs — Full Research Development Document — April 2026* *Revision 2: post-reviewer critique. Replaces causalcardio_jepa_full.md* --- ## Change log from revision 2 (post-E0 audit, 2026-04-14) - ECG input revised from 12-lead @ 500 Hz to **single lead II @ 250 Hz** (lead II present in 93.7% of HF-mirror segments; 12-lead not available in this dataset) - ECG patch size revised: 200 ms = **50 samples @ 250 Hz**, 1D over single lead (was 2D (12, 25) @ 500 Hz) - AF label source locked to **PTB-XL** (see `docs/af_label_decision.md`): MIMIC-IV-ECG path blocked by (a) ~381-patient cohort yielding <100 AF-positive, (b) missing PhysioNet credentialing. Paper now frames AF eval as a transfer claim - PPG encoding locked to **raw patches** for v1 per E1 Stage-1 result (extraction rate 98.6% but Stage-2 probe deferred to ablation A1 when AF labels are integrated) - Baseline A (ECG-JEPA) cannot load Weimann's 12-lead PTB-XL checkpoints; must retrain from scratch on single-lead II to be an honest comparison ## Change log from revision 1 - Renamed throughout from CausalCardio-JEPA → PhysioJEPA - Core claim simplified to one sentence; PTT demoted from contribution to validation signal - v1 architecture stripped to minimum: raw PPG patches, EMA only, no cardiac phase encoding, no SIGReg - Morphological encoding, cardiac phase encoding, SIGReg moved to labelled ablations - "Causal" language replaced throughout with "physiologically informed asymmetry" or "directional asymmetry" - AnyPPG characterisation corrected: ECGFounder encoder is frozen during AnyPPG training - Venue targets corrected to reflect actual 2026 deadlines - PTT head reframed: validation signal, not contribution --- ## 1. The Hypothesis **Core claim — one sentence:** > Predicting PPG at a variable time offset Δt from ECG produces cardiovascular representations that encode vascular timing structure, while contrastive alignment at t=0 and predictive alignment at t=0 both destroy this structure. **What this means concretely:** After self-supervised pretraining on synchronized ECG+PPG without labels, the model should: 1. Predict PPG windows N beats ahead from ECG context with lower error than predicting mean PPG — the model is actually learning something 2. Outperform a symmetric JEPA trained at Δt=0 on downstream cardiovascular tasks — the temporal offset matters 3. Produce latent embeddings where PTT (measured post-hoc from the latent's optimal Δt) correlates with ground-truth PTT from peak detection — PTT is implicitly encoded 4. Show physiologically consistent rollout: predicted optimal Δt varies inversely with heart rate and directly with blood pressure categories Points 1 and 2 are the paper. Points 3 and 4 are the supporting evidence. **Why this is different from existing methods:** Every prior cross-modal ECG-PPG method treats the two modalities as symmetric windows on the same cardiac state at the same moment: - **AnyPPG** (Nie et al., 2511.01747): symmetric InfoNCE at t=0. Important nuance: the ECGFounder encoder is *frozen* during AnyPPG training — it functions as a fixed supervisory signal, not a jointly-learned representation. This means AnyPPG is not even learning a shared representation; it is distilling a frozen ECG model into a PPG encoder. Same-time alignment still applies. - **TSTA-Net** (Liu et al., PMLR 2025): hierarchical contrastive learning with spatiotemporal alignment of ECG and PPG. Same-time alignment. - **PPGFlowECG** (Fang et al., 2509.19774): uses InfoNCE instance alignment internally in Stage 1, then rectified flow generation in Stage 2. Both stages operate at t=0 alignment. - **CardioGAN** (Sarkar & Etemad, AAAI 2021): CycleGAN-based adversarial waveform synthesis. Pixel-space signal translation, not representation learning. t=0. All of them discard the ECG→PPG lag. The lag is the measurement: PTT ≈ 100–400ms encodes arterial stiffness, which encodes blood pressure via the Moens-Korteweg equation. PPGFlowECG even acknowledges this in Figure 1 ("ventricular electrical activation precedes the peripheral pulse") but their architecture doesn't use it. **Why JEPA specifically:** JEPA's implicit bias — shown formally by Balestriero & LeCun (LeJEPA, 2511.08544) and empirically by Weimann & Conrad (2410.13867) — is toward high-influence, predictable features. In a cardiac signal, the most stable and predictable cross-modal feature is the time-shifted PPG peak following the QRS complex. JEPA will naturally attend to this; symmetric InfoNCE cannot because it penalises the model for not aligning ECG(t) with PPG(t), actively destroying the lag information in order to minimise the contrastive loss. --- ## 2. Architecture ### v1 (what runs in the experiment matrix) The minimum architecture needed to test the core claim. No unnecessary complexity. ``` INPUT (revised post-E0, 2026-04-14) ─────────────────────────────────────────────────────── ECG: [B, 1, 2500] — lead II, 10s @ 250Hz (native HF-mirror rate) PPG: [B, 1, 1250] — fingertip PPG (Pleth), 10s @ 125Hz (native) Temporal alignment: sample-accurate (shared segment clock per HF record) PREPROCESSING ─────────────────────────────────────────────────────── ECG: bandpass 0.5–40 Hz → z-score normalisation per window R-peak detection (Pan-Tompkins) only used for PTT ground truth, not consumed by the encoder PPG: bandpass 0.5–8 Hz → z-score normalisation [v1: raw patches only — no morphological extraction] Segments without lead II (~6.3%) are dropped. TOKENISATION ─────────────────────────────────────────────────────── ECG context encoder: - 1D patch: 50 samples = 200ms @ 250Hz - 50 patches per 10s window - Linear projection → d=256 - 1D sinusoidal positional encoding (time) [v1: single-lead; multi-lead 2D is deferred — only II/V/aVR consistently present, and the Δt claim is lead-agnostic] PPG target encoder: - 1D patch: 25 samples = 200ms per patch - 60 patches per 10s window - Linear projection → d=256 - 1D sinusoidal positional encoding [v1: raw patches — not morphological tokens] ECG CONTEXT ENCODER E_e ─────────────────────────────────────────────────────── ViT-S (adapted from Weimann & Conrad ECG-JEPA, 1D instead of 2D) 12 transformer layers, d=256, 8 heads, MLP ratio=4 I-JEPA masking within ECG (multi-block, 50% ratio) for auxiliary loss EMA updated: τ annealed 0.996→0.9999 over first 30% of training Note: cannot load Weimann's published 12-lead checkpoints directly; Baseline A retrains from scratch on single-lead II for fair comparison PPG TARGET ENCODER E_p [EMA updated] ─────────────────────────────────────────────────────── ViT-T (lighter: 6 layers, d=256) No masking — encodes full PPG window as target EMA updated: same τ schedule as E_e [v1: EMA only — SIGReg is an ablation, not v1] Δt EMBEDDING ─────────────────────────────────────────────────────── Scalar Δt ∈ [50ms, 500ms] → sinusoidal encoding → R^64 Linear projection → R^256 Added to predictor as conditioning token CAUSAL PREDICTOR P ─────────────────────────────────────────────────────── 4-layer cross-attention transformer Query: positional tokens for target PPG window positions Key/Val: ECG context latents z_e + Δt conditioning token Output: predicted PPG latent ẑ_p(t+Δt) The predictor sees no PPG input — only ECG latents + Δt. This is the architectural enforcement of directional asymmetry. LOSS FUNCTION (v1) ─────────────────────────────────────────────────────── L_total = L_cross + 0.3 * L_self L_cross = L1(ẑ_p(t+Δt), z_p(t+Δt)) ← main prediction loss L_self = L1(ẑ_e_masked, z_e_target) ← auxiliary ECG self-prediction [v1: no SIGReg, no PTT head in training loop] Δt SAMPLING ─────────────────────────────────────────────────────── Per batch: 60% log-uniform in [50ms, 500ms] 40% ground-truth PTT measured from aligned dataset ``` ### Ablations (not v1 — run after E3 passes K2) | Ablation | What changes | What it tests | |----------|-------------|---------------| | A1: Morphological PPG | PPG target encoder uses morphological tokens instead of raw patches | Does structured PPG encoding improve latent quality? | | A2: Cardiac phase encoding | Add beat-phase positional encoding (P/QRS/ST/T) to ECG encoder | Does phase-aware PE beat standard 2D sinusoidal? | | A3: SIGReg instead of EMA | Replace EMA with SIGReg (Balestriero & LeCun 2511.08544) | Is SIGReg more stable than EMA on cardiac signals? | | A4: Joint PTT head | Add PTT regression MLP head to training loss (γ=0.1) | Does supervised PTT signal improve latent vascular encoding? | | A5: Curriculum Δt | Start with ground-truth PTT only, introduce log-uniform Δt after 30% training | Does curriculum scheduling improve PTT coherence? | --- ## 3. Required Resources ### Compute - **E0–E2 (baseline suite)**: ~10 GPU-hours (3 baselines × 20 epochs × small data) - **E3 (full training)**: ~48–72 hours on A100/H100 for 100 epochs - **E4–E6**: ~10 GPU-hours (frozen encoder probes + ablations) - **Full ablation suite (A1–A5)**: ~5 × 24h = 120 hours - **Total to paper-ready**: ~200 GPU-hours ≈ $500 on Runpod H100 ### Data Primary: `lucky9-cyou/mimic-iv-aligned-ppg-ecg` (HuggingFace, instant) Fallback (if E0 fails): PhysioNet BIDMC (ECG+PPG, documented alignment, open access) PTT validation: MIMIC-BP curated dataset (UCL/UCI, 1,524 patients) ### Software - Base codebase: `kweimann/ECG-JEPA` (MIT licence) - PPG peak detection: `wfdb` + `scipy.signal` - SIGReg (ablation A3): ~50 lines PyTorch, implement from Balestriero & LeCun 2511.08544 - Evaluation: `sklearn` linear probe + custom rollout harness ### People and timeline - Guy: architecture, training loop, paper - Zack: data pipeline, PPG encoder, evaluation harness - Weeks 1–2: E0→E3 (go/no-go on K2) - Weeks 3–4: E4→E6 + ablations (if green) - Weeks 5–8: writing --- ## 4. Execution plan See the experiment matrix document (`physiojep_experiment_matrix.md`) for day-by-day detail. Summary: | Days | Task | Gate | |------|------|------| | 1–2 | E0: data audit | Dataset go/no-go | | 3 | E1: PPG encoding decision | Architecture lock | | 4–5 | E2: baseline suite | Floor + ceiling | | 6–8 | E3: PhysioJEPA v1 | K1/K2/K3 at epoch 25 | | 9–10 | E4: rollout coherence | World model evidence | | 11–12 | E5: downstream probes | PTT/AF/HR numbers | | 13–14 | E6: decisive ablation (Δt vs Δt=0) | Table 1 of paper | | 15 | Green/yellow/red decision | What gets written | --- ## 5. Pitfalls and Failure Modes ### Pitfall 1: Dataset alignment coarser than 50ms **Probability**: Medium. HuggingFace mirror is undocumented. **Symptom**: PTT ground-truth variance >100ms within-patient **Response**: Pivot to PhysioNet BIDMC immediately (2-day delay) **Impact on claim**: Architecture identical; only provenance label changes ### Pitfall 2: Morphological PPG feature extraction unreliable **Note**: This is now an ablation (A1), not v1. If E1 shows morphological encoding is unreliable, we simply don't run A1. This is no longer a project-killing risk. ### Pitfall 3: EMA collapse **Probability**: Low. ECG-JEPA with EMA is validated at scale. **Symptom**: Mean cosine sim >0.99 for 500 consecutive steps **Response**: Reduce τ start to 0.99, check batch size; add SIGReg (ablation A3) earlier **Monitoring**: Log every 100 steps from epoch 1 ### Pitfall 4: Cross-modal loss never beats mean baseline (K1) **Probability**: Low-medium. Depends on dataset quality. **Symptom**: L_cross plateau above 0.85× mean-PPG-latent baseline **Response**: Check data quality, increase window overlap, verify EMA schedule **Nuclear option**: Pivot to Architecture A (temporal ECG-JEPA, unimodal) — reuses all code ### Pitfall 5 (critical): Δt-aware ≈ t-aligned (K2) **Probability**: Unknown — this is the central empirical question. **Symptom**: E3 AUROC ≈ Baseline B AUROC (within 0.02) **Response**: This is the K2 failure mode. The core claim is wrong on this data at this scale. **Pivot options**: Architecture A, Study 4 (anomaly detection), or re-run on BIDMC ### Pitfall 6: Shortcut learning **Probability**: Medium, especially early in training. **Symptom**: Model predicts mean PPG morphology for all inputs; L_cross decreases but predictions are identical regardless of ECG input **Detection**: Compute per-patient prediction variance — if near zero, shortcut is occurring **Response**: Increase batch diversity, add within-patient hard negatives to Δt sampling ### Pitfall 7: PTT coherence fails (E4 passes but PTT probe fails) **Probability**: Low-medium. **Implication**: The temporal structure is encoded nonlinearly. Try 3-layer MLP probe instead of linear. If that fails, this is a limitation — remove PTT probe from paper claims but keep E4 rollout coherence evidence. --- ## 6. Checkpoints | # | When | Pass criterion | Fail action | |---|------|----------------|-------------| | C1 | Day 2 | Alignment ≤50ms; ≥500 patients; missing ≤20% | Pivot to BIDMC | | C2 | Day 3 | E1 decision made and committed | Block on architecture | | C3 | Day 5 | Baseline B training stable (no collapse) | Add SIGReg to E3 from start | | C4 | Day 8 (epoch 25) | K1: L_cross < 0.85× mean baseline | Fix or exit | | C5 | Day 8 (epoch 25) | K2: E3 AUROC > Baseline B + 0.02 | Paper doesn't exist | | C6 | Day 8 (epoch 25) | K3: E3 AUROC ≥ Baseline A − 0.01 | Reduce PPG encoder capacity | | C7 | Day 10 | E4: Spearman(optimal Δt, ground-truth PTT) > 0.30 | Keep as limitation | | C8 | Day 12 | E5: PTT probe MAE < naive by 20% | 3-layer MLP probe fallback | | C9 | Day 14 | E6: Δt>0 > Δt=0 on ≥2 of 3 metrics | Re-examine K2 | --- ## 7. Evaluation Protocol ### Primary metrics (determine the paper) **E3 / E6 — Core claim test** | Metric | What it tests | Baseline | |--------|--------------|---------| | AF detection AUROC (linear probe, frozen) | Representation quality | ECG-JEPA: 0.945 (Weimann 2410.13867) | | HR regression R² (linear probe, frozen) | Cardiovascular signal content | RR-interval baseline | | ECG-PPG retrieval R@1 | Cross-modal alignment | AnyPPG: 0.736 | **E4 — World model evidence (rollout coherence)** | Check | Pass criterion | |-------|---------------| | Spearman(optimal Δt, measured PTT) | > 0.30 | | HR-PTT inverse ordering | Significant, p < 0.05 | | U-shaped prediction error curve | ≥60% of patients | **E5 — Downstream validation** | Task | Metric | Framing | |------|--------|---------| | PTT regression (linear probe) | MAE (ms) vs naive | Validation only — not the contribution | | AF sample efficiency | AUROC at 1/5/10/100% labels | JEPA sample efficiency advantage | ### Evaluation philosophy Table 1 of the paper (from E6): a 4-row × 4-column table showing Baseline A (ECG-JEPA), Baseline B (Δt=0), Baseline C (InfoNCE), and PhysioJEPA across AF AUROC, HR R², PTT correlation, and retrieval R@1. If rows 3 and 4 are clearly separated, the paper exists. The PTT probe and rollout coherence are supporting figures. They interpret why the representation quality is better. They do not constitute the primary claim. --- ## 8. Critic — Strongest Arguments Against ### Critic 1: PTT can be computed with peak detection in 10 lines of code **Correct.** That is exactly why PTT is a *validation signal*, not the contribution. We are not claiming novelty in PTT computation. We are claiming that a model trained on the Δt prediction objective implicitly encodes PTT in its latent space — which is evidence that the latent captures vascular dynamics rather than just cardiac rhythm. If the same latent did *not* encode PTT, we would doubt that it learned anything physiologically meaningful. ### Critic 2: Small dataset vs AnyPPG's 100k+ hours **Conceded.** We are not competing at scale. The comparison is controlled: PhysioJEPA vs Baseline C (InfoNCE) trained on the same N hours. The architectural claim is about inductive bias on fixed data, not about scale. We report this comparison explicitly. ### Critic 3: "Physiological asymmetry" is just an architectural choice, not a principled claim **Partially conceded.** The architecture encodes a *hypothesis* about the direction of information flow (ECG→PPG). If the ablation (Baseline B, symmetric at Δt>0) performs identically to PhysioJEPA, the asymmetry contributed nothing and we remove it from the contribution list. The ablation is the test. ### Critic 4: The Δt sampling mixing ratio (60/40) is a hyperparameter **Correct.** Ablation A5 (curriculum Δt) tests whether this specific ratio matters. For v1 we use 60/40 pragmatically; if A5 shows a different schedule is better, we adopt it. This is not a fundamental weakness — it is a hyperparameter like any other. ### Critic 5: Shortcut — the model predicts mean PPG for all inputs **Real risk.** Explicitly monitored via per-patient prediction variance (Pitfall 6). If detected, addressed before any results are reported. --- ## 9. Reviewer Critiques (updated post-feedback) The reviewer critique document (provided separately) raised five structural issues. Status of each: | Issue | Status | Resolution | |-------|--------|-----------| | 3 contributions in 1 paper | Fixed | Core claim reduced to one sentence; PTT and morphology are evidence/ablations | | PTT head framing backwards | Fixed | PTT is validation signal; cross-modal Δt prediction is the claim | | Morphological encoding = #1 technical risk | Fixed | Moved to ablation A1; not in v1 | | "Causal" overclaimed | Fixed | Renamed to PhysioJEPA; language changed to "directional asymmetry" / "physiologically informed" | | Core idea not isolated | Fixed | E3 vs Baseline B (Δt=0) is the controlled isolation; both are identical except Δt | | Baselines needed from Week 1 | Fixed | E2 baseline suite runs days 4–5, before E3 | | "World model" evaluation missing | Fixed | E4 rollout coherence is explicit and uses physiological consistency checks | --- ## 10. Open Questions **Q1: How well is the MIMIC-IV aligned PPG-ECG dataset actually aligned?** Unknown until E0. The most important unanswered question. Answer by Day 2. **Q2: Does the asymmetric architecture (ECG predicts PPG, not PPG predicts ECG) outperform the symmetric version?** This is ablation A1's question at the architecture level. Baseline B isolates Δt but not directionality — if we add a symmetric Δt>0 variant (PPG predicts ECG with the same lag), we can test this separately. Lower priority; add if time permits. **Q3: Does the cross-modal training improve the ECG encoder relative to ECG-only training?** K3 tests this: E3 AUROC should match Baseline A (ECG-JEPA alone). If it's worse, the cross-modal objective is hurting the ECG representation. This would be a significant negative result worth reporting. **Q4: How does the model behave during AF?** AF removes the periodic P-wave and makes RR intervals irregular. The Δt sampling may fail to find a meaningful optimal during AF episodes. This is actually interesting — the model's inability to predict a stable optimal Δt during AF could itself be a detection signal. Monitor in E4. **Q5: Is MIMIC-BP the right held-out dataset for PTT validation?** MIMIC-BP (Kachuee et al.) is derived from MIMIC-III; the training data is MIMIC-IV-derived. Same institution (BIDMC), no patient overlap, but similar population. This is a reasonable evaluation setup but should be documented carefully to pre-empt reviewer concerns about distribution leakage. --- ## 11. Paper Identity and Venues **Title**: *PhysioJEPA: Learning Cardiovascular Dynamics via Time-Shifted Cross-Modal Prediction* **One-paragraph abstract (draft)**: Contrastive self-supervised methods for ECG-PPG representation learning align same-time signals in a shared embedding space, discarding the physiological lag between cardiac electrical activation and peripheral perfusion. This lag — the pulse transit time (PTT) — encodes arterial stiffness and correlates with blood pressure. We introduce PhysioJEPA, a JEPA-based world model that instead trains an ECG encoder to predict PPG latents at a variable time offset Δt, preserving and exploiting the directional temporal structure that contrastive methods destroy. We show that Δt-aware prediction produces cardiovascular representations that (1) outperform same-time contrastive alignment on AF detection sample efficiency, (2) implicitly encode PTT without label supervision — demonstrated via rollout coherence tests and linear probing — and (3) transfer more efficiently from limited labelled data than InfoNCE-trained baselines. Code and models are released under an open licence. **Venue targets (updated with real 2026 deadlines)**: | Venue | Deadline | Type | Fit | |-------|----------|------|-----| | NeurIPS 2026 workshops (TS4H, BrainBodyFM) | ~August 2026 | Workshop (non-archival) | Strong — 4-page format, time series + health | | ML4H 2026 | ~September 2026 (estimated from 2025 pattern) | Symposium (archival proceedings track) | Strong — healthcare ML focus, 8 pages | | ICLR 2027 | ~October 2026 | Conference (archival) | Stretch — needs clean ablations and strong Table 1 | | NeurIPS 2026 main | May 6, 2026 | Conference (archival) | Too soon — experiment matrix runs through mid-May | **Realistic path**: NeurIPS 2026 workshop (TS4H) as the first landing point (~August deadline, results from experiment matrix available by then); ML4H 2026 as the archival target; ICLR 2027 as stretch if the rollout coherence result is strong. --- *Document revision 2 — April 2026* *All "CausalCardio-JEPA" references replaced. Reviewer feedback incorporated.* *Active documents: this file + physiojep_experiment_matrix.md*