PhysioJEPA / docs /EXPERIMENT_TRACKING.md

Upload folder using huggingface_hub

31e2456 verified 27 days ago

19.5 kB

	# PhysioJEPA — Minimal Experiment Matrix
	Oz Labs — April 2026
	Revision 2: post-reviewer critique. All "CausalCardio-JEPA" references replaced.

	---

	## The single question this matrix answers

	> Does predicting PPG at Δt from ECG produce better cardiovascular representations
	> than aligning ECG and PPG at t=0?

	Every experiment below either answers this question or gates the next one.
	Nothing else runs until K2 is resolved.

	---

	## Experiment map overview

	```
	Day 1–2 E0: Data audit → Go/No-go on dataset
	│
	▼
	Day 3 E1: Morphology vs raw → Choose PPG encoding, once, forever
	│
	▼
	Day 4–5 E2: Baselines A+B+C → Establish floor and ceiling
	│
	▼
	Day 6–8 E3: Δt-JEPA v1 → Core claim test (K1, K2, K3)
	│
	├── FAIL → exit
	│
	▼
	Day 9–10 E4: Rollout coherence → World model validation
	│
	▼
	Day 11–12 E5: PTT probe → Downstream validation
	│
	▼
	Day 13–14 E6: Ablation Δt=0 vs Δt>0 → Isolate the single variable
	│
	▼
	Day 15 Decision: paper or pivot
	```

	---

	## E0 — Data audit
	Days 1–2 \| Prerequisite for everything

	### What to run

	```python
	import datasets
	ds = datasets.load_dataset("lucky9-cyou/mimic-iv-aligned-ppg-ecg")

	# For each record, compute:
	# 1. ECG-PPG alignment tolerance
	alignment_error_ms = []
	for record in ds:
	r_peak_ts = detect_r_peaks(record['ecg'])
	ppg_peak_ts = detect_ppg_peaks(record['ppg'])
	ptt = align_peaks(r_peak_ts, ppg_peak_ts)
	alignment_error_ms.append(ptt_variability(ptt))

	# 2. Coverage
	n_patients = len(set(record['subject_id'] for record in ds))
	total_hours = sum(record['duration'] for record in ds) / 3600
	missing_pct = mean_missing_rate(ds)
	```

	### Pass criteria — ALL must be true

	\| Metric \| Pass \| Fail action \|
	\|--------\|------\|-------------\|
	\| Median alignment ≤ 50ms \| ✓ proceed \| Pivot to PhysioNet BIDMC \|
	\| PTT within-patient std ≤ 80ms \| ✓ proceed \| Same pivot \|
	\| Patients ≥ 500 \| ✓ proceed \| Supplement with PhysioNet MIMIC-III waveforms \|
	\| Missing rate ≤ 20% after windowing \| ✓ proceed \| Tighten quality filter \|
	\| PTT range [50ms, 500ms] physiologically plausible \| ✓ proceed \| Check synchronisation method \|

	### Output
	- `data_card.md`: patients, hours, alignment stats, missing rates
	- `ptt_histogram.png`: histogram of measured PTT per patient
	- Go/no-go decision logged in `experiments/e0_decision.md`

	If E0 fails: PhysioNet BIDMC (ECG + PPG, documented 0.1ms alignment, 53 subjects — smaller but clean). All downstream experiments are identical; only scale changes.

	---

	## E1 — Morphology vs raw PPG patches
	Day 3 \| One-time architectural decision

	### What to run

	Two target encoders, same ViT-S backbone, 10% of data, 20 epochs each:

	E1a — Raw patch encoder
	- PPG windowed into 200ms patches (25 samples at 125Hz)
	- Linear projection → d=256 tokens
	- Standard I-JEPA spatial masking within window

	E1b — Morphological encoder
	- Per-beat features: systolic peak height, diastolic notch depth, pulse width, upstroke slope, augmentation index
	- Extracted via Bishop & Ercole peak detection + `scipy.signal`
	- Linear projection → d=256 tokens per beat

	### Metrics to compare

	\| Metric \| What it tests \|
	\|--------\|--------------\|
	\| % beats with valid morphology extraction \| Is E1b viable on this dataset? \|
	\| Target encoder latent variance \| Stability (collapse check) \|
	\| Linear probe AUROC on AF (frozen, 100 AF / 100 normal) \| Representation quality \|
	\| MAE of PTT regression from frozen encoder \| Vascular information content \|

	### Decision rule (made once, frozen)

	```
	if morphology_extraction_rate < 0.70:
	USE raw patches (E1a)

	elif E1b linear_probe_AUROC > E1a + 0.02:
	USE morphological (E1b)

	else:
	USE raw patches (E1a) — simpler, fewer failure modes
	```

	### Output
	- `e1_decision.md`: which encoder, exact threshold used, quality stats
	- `ppg_encoder.py`: the chosen implementation, committed to repo

	---

	## E2 — Baseline suite
	Days 4–5 \| Floor and ceiling

	Run all three in parallel. Same data split, same 20 epochs, same evaluation harness.
	These are reference points for E3, not ablations.

	### AF label source — decide before running E2

	Decision required by: Day 3 (before baselines start training)
	Owner: Zack

	Option 1 — MIMIC-IV ECG module (preferred)
	Join `mimic-iv-ecg` rhythm annotations to the aligned waveform dataset by `subject_id` + `hadm_id`.
	- Pros: in-distribution, same patient population as training data
	- Cons: requires verifying the join yields enough AF-positive patients (need ≥100 AF, ≥100 normal for the linear probe to be meaningful)
	- Check: `SELECT count(*) FROM mimic-iv-ecg WHERE rhythm = 'atrial fibrillation'` on the HF mirror

	Option 2 — PTB-XL (fallback)
	Use PTB-XL rhythm labels as the AF evaluation benchmark.
	- Pros: clean, well-labelled, already used by Weimann & Conrad (enables direct comparison)
	- Cons: different population (German outpatient vs MIMIC ICU) — becomes a generalisation test, not in-distribution
	- Note: framing in paper changes slightly to "transfer to PTB-XL" rather than "in-distribution evaluation"

	Option 3 — PhysioNet AFDB
	MIT-BIH AF Database: 25 long-term ECG recordings with AF annotations.
	- Only if Options 1 and 2 both fail
	- Very small; only useful for AUROC, not for sample efficiency curves

	Decision log:
	```
	AF_LABEL_SOURCE = "" # fill in before Day 4
	DECISION_DATE = ""
	DECISION_BY = ""
	N_AF_POSITIVE = 0 # verify after join/filter
	N_AF_NEGATIVE = 0
	```

	### Baseline A — ECG-JEPA (Weimann & Conrad exact replication)
	```python
	# Fork: github.com/kweimann/ECG-JEPA
	# Config: ViT-S/8, multi-block masking, EMA τ=0.996
	# Input: ECG only (no PPG at all)
	# Loss: standard I-JEPA L1 latent prediction (within ECG)
	```
	This is the unimodal ceiling. If our model can't match this on ECG-only tasks, something is wrong with the cross-modal architecture.

	### Baseline B — Symmetric cross-modal JEPA (Δt = 0)
	```python
	# Architecture: identical to E3 in every detail
	# EXCEPT: Δt is hardcoded to 0
	# - context: ECG window at time t
	# - target: PPG window at the SAME time t (no lag)
	# - predictor: cross-attention ECG → PPG
	# Loss: L1 latent prediction
	```
	This isolates the Δt variable. If E3 beats B on the same tasks, Δt matters. If not, the core claim fails.

	### Baseline C — InfoNCE contrastive (AnyPPG-style)
	```python
	# Architecture: same dual encoder
	# Loss: symmetric InfoNCE
	# z_ecg = ecg_encoder(ECG_t)
	# z_ppg = ppg_encoder(PPG_t)
	# L = InfoNCE(z_ecg, z_ppg, temperature=0.07)
	# No Δt, no prediction — pure alignment
	```
	This is the comparison against the dominant paradigm in the field.

	### Metrics for all three

	```
	After 20 epochs on 10% data, for each model:

	1. Pretraining loss convergence curve
	2. Linear probe AUROC — AF detection (frozen encoder)
	3. Linear probe R² — HR estimation (frozen encoder)
	4. Latent variance + eigenspectrum rank (collapse check)
	5. UMAP: coloured by patient ID, AF status, HR decile
	```

	### What to learn from E2 before running E3

	\| Observation \| Implication \|
	\|-------------\|-------------\|
	\| Baseline A AUROC > 0.80 \| ECG alone is strong; cross-modal has a high bar \|
	\| Baseline B collapses \| Symmetric cross-modal JEPA is unstable; add SIGReg to E3 from the start \|
	\| Baseline C > Baseline A \| Cross-modal information helps; our model has something to beat \|
	\| All three collapse \| Data quality problem — revisit E0 \|

	---

	## E3 — Δt-JEPA v1
	Days 6–8 \| The paper test

	Minimal version of the actual contribution.
	PPG encoding from E1 decision. No SIGReg. No cardiac phase encoding.
	Just: ECG context predicts PPG target at t+Δt.

	### Architecture

	```python
	# ECG encoder: ViT-S/8, 2D patches (leads × time), EMA target
	# PPG encoder: ViT-S/8, encoding chosen in E1, EMA target
	# Predictor: 4-layer cross-attention transformer
	# query = positional tokens for target PPG beats
	# key/val = ECG context latents + Δt embedding
	# Δt embed: sinusoidal over [50ms, 500ms] → R^256

	# Loss:
	# L_cross = L1(predicted_ppg_latent, ema_ppg_encoder_output)
	# L_self = L1(masked_ecg_pred, ema_ecg_target) [auxiliary, α=0.3]
	# L_total = L_cross + α * L_self

	# Δt sampling per batch:
	# 60% log-uniform in [50ms, 500ms]
	# 40% ground-truth PTT from dataset
	```

	### Training config

	```yaml
	epochs: 100
	batch_size: 64
	optimizer: AdamW, lr=1e-4, weight_decay=0.04
	scheduler: cosine with 10-epoch warmup
	ema_tau: 0.996 → 0.9999 over first 30% of training
	window: 10s ECG + matched PPG
	stride: 5s
	data: 100% of passing-E0 records
	```

	### Collapse monitoring (every 100 steps)

	```python
	# Log these — stop if cross_modal_cosim > 0.99 for 500 consecutive steps
	metrics = {
	'ecg_latent_variance': var(z_ecg).mean(),
	'ppg_latent_variance': var(z_ppg).mean(),
	'cross_modal_cosim': cosine_sim(z_ecg_pooled, z_ppg_pred).mean(),
	'ecg_eigenspectrum_rank': effective_rank(cov(z_ecg)),
	}
	```

	### Kill criteria — evaluated at epoch 25

	K1 — Is the model learning anything?
	```python
	mean_baseline_loss = L1(z_ppg_target, z_ppg_mean_over_dataset)
	# PASS: model_loss < 0.85 * mean_baseline_loss
	```

	K2 — Does Δt matter? (the core claim)
	```python
	# Run identical linear probe on frozen E3 and Baseline B encoders
	# PASS: E3_AUROC > Baseline_B_AUROC + 0.02 (AF detection)
	# OR E3_R² > Baseline_B_R² + 0.05 (HR estimation)
	# At least one metric must pass
	```

	K3 — Does cross-modal not hurt relative to unimodal?
	```python
	# PASS: E3_AUROC >= Baseline_A_AUROC (within 0.01)
	```

	### Decision tree at epoch 25

	```
	K1 FAIL → Stop entirely.
	Data is unusable or encoder collapsed.
	Check alignment, quality filtering, EMA schedule.
	If clean: the architecture is wrong. Move to Architecture A (temporal ECG-JEPA only).

	K2 FAIL → Stop. The paper does not exist.
	Δt-aware prediction ≈ t-aligned prediction.
	Pivot options:
	(a) Architecture A — temporal unimodal ECG-JEPA
	(b) Study 4 — anomaly detection reusing this codebase
	(c) Rerun with cleaner BIDMC data before final decision.

	K2 PASS + K3 FAIL → Cross-modal hurts.
	Run 10 more epochs. If still failing:
	Reduce PPG encoder capacity, check EMA instability.
	If persistent: use lighter PPG encoder (ViT-T instead of ViT-S).

	K1 ✓, K2 ✓, K3 ✓ → Continue to epoch 100. Proceed to E4.
	```

	---

	## E4 — Rollout coherence test
	Days 9–10 \| World model validation

	This is the experiment that separates "JEPA with a lag" from "a cardiovascular world model." Without it, the paper cannot make the world model claim.

	### Protocol

	```python
	# Frozen encoder + trained predictor. N=200 held-out patients.

	for patient in held_out_patients:
	z_ecg = ecg_encoder(ecg_window_t)

	# Predict at a grid of Δt values
	delta_t_grid = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500] # ms
	errors = []
	for dt in delta_t_grid:
	z_ppg_pred = predictor(z_ecg, delta_t=dt)
	z_ppg_true = ppg_encoder(ppg_window_at_t_plus_dt)
	errors.append(L1(z_ppg_pred, z_ppg_true))

	# Find optimal Δt (prediction error minimum)
	optimal_delta_t[patient] = delta_t_grid[argmin(errors)]
	```

	### Physiological consistency checks

	```python
	# Check 1: Does optimal_Δt correlate with measured PTT?
	correlation = spearman(optimal_delta_t, measured_ptt_per_patient)
	# PASS: correlation > 0.30

	# Check 2: HR-PTT inverse relationship
	# High HR → shorter PTT → shorter optimal Δt
	high_hr = windows_where(hr > 90 bpm)
	low_hr = windows_where(hr < 60 bpm)
	# PASS: mean(optimal_Δt[high_hr]) < mean(optimal_Δt[low_hr]), p < 0.05

	# Check 3: U-shaped error curve (predictor has a real minimum, not flat)
	for patient in sample_50_patients:
	assert has_clear_minimum(errors) # not monotone, not flat
	# PASS: ≥ 60% of patients have clear minimum
	```

	### Pass criteria

	\| Check \| Pass \| Implication if pass \|
	\|-------\|------\|---------------------\|
	\| Spearman > 0.30 \| Model learned PTT implicitly \| Core world-model claim supported \|
	\| HR-PTT ordering \| Physiologically consistent \| Not a lookup table \|
	\| U-curve ≥ 60% \| Predictor has a real minimum \| Latent space is smooth \|

	### If E4 passes but E5 PTT probe fails
	The representation has the information but a linear probe can't extract it. Try a 3-layer MLP probe. If that also fails, the PTT information is encoded nonlinearly — mention this as a limitation but don't remove the E4 claim from the paper.

	---

	## E5 — Downstream probes
	Days 11–12 \| Validation signals

	These run on frozen encoders from E3 best checkpoint. They are probes, not contributions.

	### E5a — PTT regression probe
	```python
	mlp_ptt = MLP(in=256, hidden=128, out=1)
	train(mlp_ptt,
	X = pool(ecg_latent),
	y = measured_ptt_per_beat,
	split = patient_level_80_20)

	# Report:
	# MAE (ms) vs naive mean-PTT baseline
	# Pearson(predicted_ptt, measured_ptt)
	# Within-patient: does the probe track PTT changes over time?
	```

	### E5b — AF detection sample efficiency
	```python
	# Same linear probe as used in E2/E3 — enables direct comparison
	# Label fractions: 1%, 5%, 10%, 50%, 100%
	# Models: E3 vs Baseline_A vs Baseline_C
	# Goal: sample efficiency curve (not just full-data comparison)
	```

	### E5c — HR estimation
	```python
	# Linear regression on frozen latent → HR
	# Baseline: RR-interval to HR (trivial — sets floor)
	```

	### What must be true for the paper

	\| Result \| Why it matters \|
	\|--------\|----------------\|
	\| E5a MAE < naive by ≥ 20% \| PTT is in the latent — confirms E4 \|
	\| E5b: E3 ≥ Baseline_A at all label fractions \| Cross-modal doesn't hurt \|
	\| E5b: E3 > Baseline_C at 1% labels \| JEPA more sample-efficient than InfoNCE \|

	---

	## E6 — The decisive ablation
	Days 13–14 \| The main result

	One variable changed. Everything else identical.

	\| Model \| Δt \| Architecture \|
	\|-------\|-----\|-------------\|
	\| E3 (PhysioJEPA) \| log-uniform [50, 500ms] \| Identical \|
	\| Baseline B (t-aligned) \| Fixed 0ms \| Identical \|

	Both trained to 100 epochs, full data. Evaluated identically.

	### The comparison table (this becomes Table 1 of the paper)

	```
	Model \| AF AUROC \| HR R² \| PTT R² \| ECG-PPG R@1
	────────────────────────────────────────────────────────────────
	Baseline A (ECG) \| \| \| N/A \| N/A
	Baseline B (Δt=0) \| \| \| \|
	Baseline C (InfoNCE)\| \| \| \|
	E3 (Δt>0, ours) \| \| \| \|
	```

	### Paper-level claim, if E6 supports it

	> Predicting PPG at variable time offset Δt from ECG produces latent representations
	> that implicitly encode vascular timing structure (PTT).
	> Contrastive alignment at t=0 and predictive alignment at t=0 both destroy this structure.
	> This is demonstrated by improved PTT regression, superior sample efficiency on AF detection,
	> and physiologically consistent rollout behaviour under varying heart rate.

	One paragraph. Defensible. Not overclaiming causality or blood pressure.

	---

	## Day 15 — Decision

	```
	GREEN — all of K1, K2, K3, E4 coherence, E6 Δt > Δt=0
	→ Write the paper.
	→ Weeks 3–4: run ablations A1–A5 (morphology, phase encoding,
	SIGReg, PTT head, curriculum Δt).
	→ Target venues (with actual 2026 deadlines):
	NeurIPS 2026 workshops (TS4H, BrainBodyFM): ~August 2026
	ML4H 2026 symposium (archival proceedings track): ~September 2026
	ICLR 2027: ~October 2026 (needs strong E4 + clean ablations)

	YELLOW — K2 passes weakly, E4 marginal
	→ Extend E3 to 200 epochs before deciding.
	→ If still weak: reframe as temporal ECG-JEPA (Architecture A).
	Smaller claim but still publishable as an extension of Weimann & Conrad.
	Target: NeurIPS 2026 workshop TS4H.

	RED — K2 fails
	→ The core idea does not work on this dataset at this scale.
	→ Immediate pivot options:
	(a) Architecture A (temporal ECG-JEPA, unimodal) — reuses everything
	(b) Study 4 (anomaly detection via prediction error) — same codebase
	(c) Re-run E0 on PhysioNet BIDMC before final call.
	Note: CHIL 2026 deadline (Apr 17) has passed. MLHC 2026 (Apr 17) has passed.
	Next realistic archival venue: ML4H 2026 (~Sep 2026 estimated).
	```

	---

	## Post-hoc (2026-04-15): K2 failed, K3 passed, τ mechanism falsified

	Actual results from the E2/E3 run (subset_frac=0.10, 25 epochs, seed=42):

	\| Model \| Config \| ep5 \| ep10 \| ep25 \|
	\|-------\|--------\|-----\|------\|------\|
	\| F (Δt>0) \| PhysioJEPA v1 \| 0.652 \| 0.859 \| 0.835 \|
	\| B (Δt=0) \| symmetric cross-modal \| 0.660 \| 0.844 \| 0.847 \|
	\| A (unimodal) \| ECG-JEPA \| 0.783 \| 0.736 \| 0.703 \|
	\| C (InfoNCE) \| symmetric \| — \| — \| under-tuned; not usable \|

	K2: FAIL. F−B at ep25 = −0.012 (target was +0.02). Δt doesn't matter.

	K3: PASS BIG. F−A at ep25 = +0.133. Cross-modal beats unimodal by
	~0.13 AUROC.

	τ-saturation mechanism (slow-τ A ablation): FALSIFIED.
	Slow-τ A (ema_end=0.999, warmup_frac=0.60) had L_self rising more than
	original A through steps 2000-5000, not less. τ is not the lever.

	Working hypothesis for A's degradation: predictor+query-embedding overfits
	to a narrow target distribution in unimodal training. Cross-modal training
	provides target diversity the predictor can't overfit to, which is why
	F/B stay stable. Needs a different ablation (e.g. shrink predictor, shrink
	query embedding, vary masking ratio) to confirm.

	## Summary

	\| Day \| Experiment \| Key output \| Decision gated \|
	\|-----\|-----------\|-----------\|----------------\|
	\| 1–2 \| E0: data audit \| data_card.md, PTT histogram \| Dataset go/no-go \|
	\| 3 \| E1: PPG encoding \| e1_decision.md, ppg_encoder.py \| Architecture lock \|
	\| 4–5 \| E2: baselines \| Floor + ceiling numbers \| Calibrates E3 expectations \|
	\| 6–8 \| E3: Δt-JEPA v1 \| K1/K2/K3 at epoch 25 \| Paper exists or doesn't \|
	\| 9–10 \| E4: rollout coherence \| World model evidence \| World model claim \|
	\| 11–12 \| E5: probes \| PTT, AF, HR numbers \| Downstream story \|
	\| 13–14 \| E6: decisive ablation \| Table 1 \| Paper's main result \|
	\| 15 \| Decision \| Green / yellow / red \| What gets written \|

	Compute to day 15 decision point: ~50–70 GPU-hours. Cost: ~$125–175.

	K2 is answered by day 8. Everything after that is filling in the paper.

	---

	## Division of work

	\| Task \| Owner \|
	\|------\|-------\|
	\| E0: data pipeline, quality metrics, PTT computation \| Zack \|
	\| E1: morphology extractor, two-encoder comparison \| Zack \|
	\| E2: ECG-JEPA fork (Baseline A), training \| Guy \|
	\| E2: InfoNCE baseline (Baseline C) \| Zack \|
	\| E2: Symmetric JEPA (Baseline B) \| Guy \|
	\| E3: Δt-JEPA architecture + training loop \| Guy \|
	\| E3: collapse monitoring, checkpoint saving \| Both \|
	\| E4: rollout coherence test, physiological checks \| Guy \|
	\| E5: probe training harness, sample efficiency curves \| Zack \|
	\| E6: final comparison, Table 1 \| Both \|
	\| Day 15 decision \| Both \|

	---

	Designed so the most important question — does Δt matter? — is answered by day 8, not day 28.
	Total time to go/no-go: 8 days. Total compute: ~50–70 GPU-hours.