# PhysioJEPA — Minimal Experiment Matrix *Oz Labs — April 2026* *Revision 2: post-reviewer critique. All "CausalCardio-JEPA" references replaced.* --- ## The single question this matrix answers > Does predicting PPG at Δt from ECG produce better cardiovascular representations > than aligning ECG and PPG at t=0? Every experiment below either answers this question or gates the next one. Nothing else runs until K2 is resolved. --- ## Experiment map overview ``` Day 1–2 E0: Data audit → Go/No-go on dataset │ ▼ Day 3 E1: Morphology vs raw → Choose PPG encoding, once, forever │ ▼ Day 4–5 E2: Baselines A+B+C → Establish floor and ceiling │ ▼ Day 6–8 E3: Δt-JEPA v1 → Core claim test (K1, K2, K3) │ ├── FAIL → exit │ ▼ Day 9–10 E4: Rollout coherence → World model validation │ ▼ Day 11–12 E5: PTT probe → Downstream validation │ ▼ Day 13–14 E6: Ablation Δt=0 vs Δt>0 → Isolate the single variable │ ▼ Day 15 Decision: paper or pivot ``` --- ## E0 — Data audit **Days 1–2 | Prerequisite for everything** ### What to run ```python import datasets ds = datasets.load_dataset("lucky9-cyou/mimic-iv-aligned-ppg-ecg") # For each record, compute: # 1. ECG-PPG alignment tolerance alignment_error_ms = [] for record in ds: r_peak_ts = detect_r_peaks(record['ecg']) ppg_peak_ts = detect_ppg_peaks(record['ppg']) ptt = align_peaks(r_peak_ts, ppg_peak_ts) alignment_error_ms.append(ptt_variability(ptt)) # 2. Coverage n_patients = len(set(record['subject_id'] for record in ds)) total_hours = sum(record['duration'] for record in ds) / 3600 missing_pct = mean_missing_rate(ds) ``` ### Pass criteria — ALL must be true | Metric | Pass | Fail action | |--------|------|-------------| | Median alignment ≤ 50ms | ✓ proceed | Pivot to PhysioNet BIDMC | | PTT within-patient std ≤ 80ms | ✓ proceed | Same pivot | | Patients ≥ 500 | ✓ proceed | Supplement with PhysioNet MIMIC-III waveforms | | Missing rate ≤ 20% after windowing | ✓ proceed | Tighten quality filter | | PTT range [50ms, 500ms] physiologically plausible | ✓ proceed | Check synchronisation method | ### Output - `data_card.md`: patients, hours, alignment stats, missing rates - `ptt_histogram.png`: histogram of measured PTT per patient - Go/no-go decision logged in `experiments/e0_decision.md` **If E0 fails**: PhysioNet BIDMC (ECG + PPG, documented 0.1ms alignment, 53 subjects — smaller but clean). All downstream experiments are identical; only scale changes. --- ## E1 — Morphology vs raw PPG patches **Day 3 | One-time architectural decision** ### What to run Two target encoders, same ViT-S backbone, 10% of data, 20 epochs each: **E1a — Raw patch encoder** - PPG windowed into 200ms patches (25 samples at 125Hz) - Linear projection → d=256 tokens - Standard I-JEPA spatial masking within window **E1b — Morphological encoder** - Per-beat features: systolic peak height, diastolic notch depth, pulse width, upstroke slope, augmentation index - Extracted via Bishop & Ercole peak detection + `scipy.signal` - Linear projection → d=256 tokens per beat ### Metrics to compare | Metric | What it tests | |--------|--------------| | % beats with valid morphology extraction | Is E1b viable on this dataset? | | Target encoder latent variance | Stability (collapse check) | | Linear probe AUROC on AF (frozen, 100 AF / 100 normal) | Representation quality | | MAE of PTT regression from frozen encoder | Vascular information content | ### Decision rule (made once, frozen) ``` if morphology_extraction_rate < 0.70: USE raw patches (E1a) elif E1b linear_probe_AUROC > E1a + 0.02: USE morphological (E1b) else: USE raw patches (E1a) — simpler, fewer failure modes ``` ### Output - `e1_decision.md`: which encoder, exact threshold used, quality stats - `ppg_encoder.py`: the chosen implementation, committed to repo --- ## E2 — Baseline suite **Days 4–5 | Floor and ceiling** Run all three in parallel. Same data split, same 20 epochs, same evaluation harness. These are reference points for E3, not ablations. ### AF label source — decide before running E2 **Decision required by**: Day 3 (before baselines start training) **Owner**: Zack **Option 1 — MIMIC-IV ECG module (preferred)** Join `mimic-iv-ecg` rhythm annotations to the aligned waveform dataset by `subject_id` + `hadm_id`. - Pros: in-distribution, same patient population as training data - Cons: requires verifying the join yields enough AF-positive patients (need ≥100 AF, ≥100 normal for the linear probe to be meaningful) - Check: `SELECT count(*) FROM mimic-iv-ecg WHERE rhythm = 'atrial fibrillation'` on the HF mirror **Option 2 — PTB-XL (fallback)** Use PTB-XL rhythm labels as the AF evaluation benchmark. - Pros: clean, well-labelled, already used by Weimann & Conrad (enables direct comparison) - Cons: different population (German outpatient vs MIMIC ICU) — becomes a generalisation test, not in-distribution - Note: framing in paper changes slightly to "transfer to PTB-XL" rather than "in-distribution evaluation" **Option 3 — PhysioNet AFDB** MIT-BIH AF Database: 25 long-term ECG recordings with AF annotations. - Only if Options 1 and 2 both fail - Very small; only useful for AUROC, not for sample efficiency curves **Decision log**: ``` AF_LABEL_SOURCE = "" # fill in before Day 4 DECISION_DATE = "" DECISION_BY = "" N_AF_POSITIVE = 0 # verify after join/filter N_AF_NEGATIVE = 0 ``` ### Baseline A — ECG-JEPA (Weimann & Conrad exact replication) ```python # Fork: github.com/kweimann/ECG-JEPA # Config: ViT-S/8, multi-block masking, EMA τ=0.996 # Input: ECG only (no PPG at all) # Loss: standard I-JEPA L1 latent prediction (within ECG) ``` This is the unimodal ceiling. If our model can't match this on ECG-only tasks, something is wrong with the cross-modal architecture. ### Baseline B — Symmetric cross-modal JEPA (Δt = 0) ```python # Architecture: identical to E3 in every detail # EXCEPT: Δt is hardcoded to 0 # - context: ECG window at time t # - target: PPG window at the SAME time t (no lag) # - predictor: cross-attention ECG → PPG # Loss: L1 latent prediction ``` This isolates the Δt variable. If E3 beats B on the same tasks, Δt matters. If not, the core claim fails. ### Baseline C — InfoNCE contrastive (AnyPPG-style) ```python # Architecture: same dual encoder # Loss: symmetric InfoNCE # z_ecg = ecg_encoder(ECG_t) # z_ppg = ppg_encoder(PPG_t) # L = InfoNCE(z_ecg, z_ppg, temperature=0.07) # No Δt, no prediction — pure alignment ``` This is the comparison against the dominant paradigm in the field. ### Metrics for all three ``` After 20 epochs on 10% data, for each model: 1. Pretraining loss convergence curve 2. Linear probe AUROC — AF detection (frozen encoder) 3. Linear probe R² — HR estimation (frozen encoder) 4. Latent variance + eigenspectrum rank (collapse check) 5. UMAP: coloured by patient ID, AF status, HR decile ``` ### What to learn from E2 before running E3 | Observation | Implication | |-------------|-------------| | Baseline A AUROC > 0.80 | ECG alone is strong; cross-modal has a high bar | | Baseline B collapses | Symmetric cross-modal JEPA is unstable; add SIGReg to E3 from the start | | Baseline C > Baseline A | Cross-modal information helps; our model has something to beat | | All three collapse | Data quality problem — revisit E0 | --- ## E3 — Δt-JEPA v1 **Days 6–8 | The paper test** Minimal version of the actual contribution. PPG encoding from E1 decision. No SIGReg. No cardiac phase encoding. Just: ECG context predicts PPG target at t+Δt. ### Architecture ```python # ECG encoder: ViT-S/8, 2D patches (leads × time), EMA target # PPG encoder: ViT-S/8, encoding chosen in E1, EMA target # Predictor: 4-layer cross-attention transformer # query = positional tokens for target PPG beats # key/val = ECG context latents + Δt embedding # Δt embed: sinusoidal over [50ms, 500ms] → R^256 # Loss: # L_cross = L1(predicted_ppg_latent, ema_ppg_encoder_output) # L_self = L1(masked_ecg_pred, ema_ecg_target) [auxiliary, α=0.3] # L_total = L_cross + α * L_self # Δt sampling per batch: # 60% log-uniform in [50ms, 500ms] # 40% ground-truth PTT from dataset ``` ### Training config ```yaml epochs: 100 batch_size: 64 optimizer: AdamW, lr=1e-4, weight_decay=0.04 scheduler: cosine with 10-epoch warmup ema_tau: 0.996 → 0.9999 over first 30% of training window: 10s ECG + matched PPG stride: 5s data: 100% of passing-E0 records ``` ### Collapse monitoring (every 100 steps) ```python # Log these — stop if cross_modal_cosim > 0.99 for 500 consecutive steps metrics = { 'ecg_latent_variance': var(z_ecg).mean(), 'ppg_latent_variance': var(z_ppg).mean(), 'cross_modal_cosim': cosine_sim(z_ecg_pooled, z_ppg_pred).mean(), 'ecg_eigenspectrum_rank': effective_rank(cov(z_ecg)), } ``` ### Kill criteria — evaluated at epoch 25 **K1 — Is the model learning anything?** ```python mean_baseline_loss = L1(z_ppg_target, z_ppg_mean_over_dataset) # PASS: model_loss < 0.85 * mean_baseline_loss ``` **K2 — Does Δt matter? (the core claim)** ```python # Run identical linear probe on frozen E3 and Baseline B encoders # PASS: E3_AUROC > Baseline_B_AUROC + 0.02 (AF detection) # OR E3_R² > Baseline_B_R² + 0.05 (HR estimation) # At least one metric must pass ``` **K3 — Does cross-modal not hurt relative to unimodal?** ```python # PASS: E3_AUROC >= Baseline_A_AUROC (within 0.01) ``` ### Decision tree at epoch 25 ``` K1 FAIL → Stop entirely. Data is unusable or encoder collapsed. Check alignment, quality filtering, EMA schedule. If clean: the architecture is wrong. Move to Architecture A (temporal ECG-JEPA only). K2 FAIL → Stop. The paper does not exist. Δt-aware prediction ≈ t-aligned prediction. Pivot options: (a) Architecture A — temporal unimodal ECG-JEPA (b) Study 4 — anomaly detection reusing this codebase (c) Rerun with cleaner BIDMC data before final decision. K2 PASS + K3 FAIL → Cross-modal hurts. Run 10 more epochs. If still failing: Reduce PPG encoder capacity, check EMA instability. If persistent: use lighter PPG encoder (ViT-T instead of ViT-S). K1 ✓, K2 ✓, K3 ✓ → Continue to epoch 100. Proceed to E4. ``` --- ## E4 — Rollout coherence test **Days 9–10 | World model validation** This is the experiment that separates "JEPA with a lag" from "a cardiovascular world model." Without it, the paper cannot make the world model claim. ### Protocol ```python # Frozen encoder + trained predictor. N=200 held-out patients. for patient in held_out_patients: z_ecg = ecg_encoder(ecg_window_t) # Predict at a grid of Δt values delta_t_grid = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500] # ms errors = [] for dt in delta_t_grid: z_ppg_pred = predictor(z_ecg, delta_t=dt) z_ppg_true = ppg_encoder(ppg_window_at_t_plus_dt) errors.append(L1(z_ppg_pred, z_ppg_true)) # Find optimal Δt (prediction error minimum) optimal_delta_t[patient] = delta_t_grid[argmin(errors)] ``` ### Physiological consistency checks ```python # Check 1: Does optimal_Δt correlate with measured PTT? correlation = spearman(optimal_delta_t, measured_ptt_per_patient) # PASS: correlation > 0.30 # Check 2: HR-PTT inverse relationship # High HR → shorter PTT → shorter optimal Δt high_hr = windows_where(hr > 90 bpm) low_hr = windows_where(hr < 60 bpm) # PASS: mean(optimal_Δt[high_hr]) < mean(optimal_Δt[low_hr]), p < 0.05 # Check 3: U-shaped error curve (predictor has a real minimum, not flat) for patient in sample_50_patients: assert has_clear_minimum(errors) # not monotone, not flat # PASS: ≥ 60% of patients have clear minimum ``` ### Pass criteria | Check | Pass | Implication if pass | |-------|------|---------------------| | Spearman > 0.30 | Model learned PTT implicitly | Core world-model claim supported | | HR-PTT ordering | Physiologically consistent | Not a lookup table | | U-curve ≥ 60% | Predictor has a real minimum | Latent space is smooth | ### If E4 passes but E5 PTT probe fails The representation has the information but a linear probe can't extract it. Try a 3-layer MLP probe. If that also fails, the PTT information is encoded nonlinearly — mention this as a limitation but don't remove the E4 claim from the paper. --- ## E5 — Downstream probes **Days 11–12 | Validation signals** These run on frozen encoders from E3 best checkpoint. They are probes, not contributions. ### E5a — PTT regression probe ```python mlp_ptt = MLP(in=256, hidden=128, out=1) train(mlp_ptt, X = pool(ecg_latent), y = measured_ptt_per_beat, split = patient_level_80_20) # Report: # MAE (ms) vs naive mean-PTT baseline # Pearson(predicted_ptt, measured_ptt) # Within-patient: does the probe track PTT changes over time? ``` ### E5b — AF detection sample efficiency ```python # Same linear probe as used in E2/E3 — enables direct comparison # Label fractions: 1%, 5%, 10%, 50%, 100% # Models: E3 vs Baseline_A vs Baseline_C # Goal: sample efficiency curve (not just full-data comparison) ``` ### E5c — HR estimation ```python # Linear regression on frozen latent → HR # Baseline: RR-interval to HR (trivial — sets floor) ``` ### What must be true for the paper | Result | Why it matters | |--------|----------------| | E5a MAE < naive by ≥ 20% | PTT is in the latent — confirms E4 | | E5b: E3 ≥ Baseline_A at all label fractions | Cross-modal doesn't hurt | | E5b: E3 > Baseline_C at 1% labels | JEPA more sample-efficient than InfoNCE | --- ## E6 — The decisive ablation **Days 13–14 | The main result** One variable changed. Everything else identical. | Model | Δt | Architecture | |-------|-----|-------------| | E3 (PhysioJEPA) | log-uniform [50, 500ms] | Identical | | Baseline B (t-aligned) | Fixed 0ms | Identical | Both trained to 100 epochs, full data. Evaluated identically. ### The comparison table (this becomes Table 1 of the paper) ``` Model | AF AUROC | HR R² | PTT R² | ECG-PPG R@1 ──────────────────────────────────────────────────────────────── Baseline A (ECG) | | | N/A | N/A Baseline B (Δt=0) | | | | Baseline C (InfoNCE)| | | | E3 (Δt>0, ours) | | | | ``` ### Paper-level claim, if E6 supports it > Predicting PPG at variable time offset Δt from ECG produces latent representations > that implicitly encode vascular timing structure (PTT). > Contrastive alignment at t=0 and predictive alignment at t=0 both destroy this structure. > This is demonstrated by improved PTT regression, superior sample efficiency on AF detection, > and physiologically consistent rollout behaviour under varying heart rate. One paragraph. Defensible. Not overclaiming causality or blood pressure. --- ## Day 15 — Decision ``` GREEN — all of K1, K2, K3, E4 coherence, E6 Δt > Δt=0 → Write the paper. → Weeks 3–4: run ablations A1–A5 (morphology, phase encoding, SIGReg, PTT head, curriculum Δt). → Target venues (with actual 2026 deadlines): NeurIPS 2026 workshops (TS4H, BrainBodyFM): ~August 2026 ML4H 2026 symposium (archival proceedings track): ~September 2026 ICLR 2027: ~October 2026 (needs strong E4 + clean ablations) YELLOW — K2 passes weakly, E4 marginal → Extend E3 to 200 epochs before deciding. → If still weak: reframe as temporal ECG-JEPA (Architecture A). Smaller claim but still publishable as an extension of Weimann & Conrad. Target: NeurIPS 2026 workshop TS4H. RED — K2 fails → The core idea does not work on this dataset at this scale. → Immediate pivot options: (a) Architecture A (temporal ECG-JEPA, unimodal) — reuses everything (b) Study 4 (anomaly detection via prediction error) — same codebase (c) Re-run E0 on PhysioNet BIDMC before final call. Note: CHIL 2026 deadline (Apr 17) has passed. MLHC 2026 (Apr 17) has passed. Next realistic archival venue: ML4H 2026 (~Sep 2026 estimated). ``` --- ## Post-hoc (2026-04-15): K2 failed, K3 passed, τ mechanism falsified Actual results from the E2/E3 run (subset_frac=0.10, 25 epochs, seed=42): | Model | Config | ep5 | ep10 | ep25 | |-------|--------|-----|------|------| | F (Δt>0) | PhysioJEPA v1 | 0.652 | 0.859 | 0.835 | | B (Δt=0) | symmetric cross-modal | 0.660 | 0.844 | **0.847** | | A (unimodal) | ECG-JEPA | 0.783 | 0.736 | 0.703 | | C (InfoNCE) | symmetric | — | — | under-tuned; not usable | **K2: FAIL.** F−B at ep25 = −0.012 (target was +0.02). Δt doesn't matter. **K3: PASS BIG.** F−A at ep25 = +0.133. Cross-modal beats unimodal by ~0.13 AUROC. **τ-saturation mechanism (slow-τ A ablation): FALSIFIED.** Slow-τ A (ema_end=0.999, warmup_frac=0.60) had L_self rising *more* than original A through steps 2000-5000, not less. τ is not the lever. Working hypothesis for A's degradation: predictor+query-embedding overfits to a narrow target distribution in unimodal training. Cross-modal training provides target diversity the predictor can't overfit to, which is why F/B stay stable. Needs a different ablation (e.g. shrink predictor, shrink query embedding, vary masking ratio) to confirm. ## Summary | Day | Experiment | Key output | Decision gated | |-----|-----------|-----------|----------------| | 1–2 | E0: data audit | data_card.md, PTT histogram | Dataset go/no-go | | 3 | E1: PPG encoding | e1_decision.md, ppg_encoder.py | Architecture lock | | 4–5 | E2: baselines | Floor + ceiling numbers | Calibrates E3 expectations | | 6–8 | E3: Δt-JEPA v1 | K1/K2/K3 at epoch 25 | Paper exists or doesn't | | 9–10 | E4: rollout coherence | World model evidence | World model claim | | 11–12 | E5: probes | PTT, AF, HR numbers | Downstream story | | 13–14 | E6: decisive ablation | Table 1 | Paper's main result | | 15 | Decision | Green / yellow / red | What gets written | **Compute to day 15 decision point: ~50–70 GPU-hours. Cost: ~$125–175.** K2 is answered by day 8. Everything after that is filling in the paper. --- ## Division of work | Task | Owner | |------|-------| | E0: data pipeline, quality metrics, PTT computation | Zack | | E1: morphology extractor, two-encoder comparison | Zack | | E2: ECG-JEPA fork (Baseline A), training | Guy | | E2: InfoNCE baseline (Baseline C) | Zack | | E2: Symmetric JEPA (Baseline B) | Guy | | E3: Δt-JEPA architecture + training loop | Guy | | E3: collapse monitoring, checkpoint saving | Both | | E4: rollout coherence test, physiological checks | Guy | | E5: probe training harness, sample efficiency curves | Zack | | E6: final comparison, Table 1 | Both | | Day 15 decision | Both | --- *Designed so the most important question — does Δt matter? — is answered by day 8, not day 28.* *Total time to go/no-go: 8 days. Total compute: ~50–70 GPU-hours.*