PhysioJEPA / docs /e0_data_card.md

Upload folder using huggingface_hub

31e2456 verified 26 days ago

7.88 kB

	# E0 — Data audit: `lucky9-cyou/mimic-iv-aligned-ppg-ecg`
	PhysioJEPA — Oz Labs — 2026-04-14

	Audit scripts: `scripts/e0_audit_v2.py`, `scripts/e0_alignment_check.py`
	Raw JSON: `docs/e0_report.json`, `docs/e0_alignment.json`
	Figures: `docs/figures/ptt_histogram.png`, `docs/figures/ptt_histogram_foot.png`, `docs/figures/sanity_check.png`

	---

	## Decision

	GO — with one caveat: the ≥500-patient gate is borderline (~381 extrapolated). Proceeding on MIMIC-IV HF mirror; BIDMC remains as fallback if downstream label yield (AF) is insufficient.

	See the gate table below for the full reasoning.

	---

	## Dataset layout

	- 412 HF `save_to_disk` shard folders. Each shard ≈ 100 segments ≈ 1 MIMIC-IV waveform record ≈ 1 patient.
	- Schema per row (verified against `shard_00000/dataset_info.json`):
	- `record_name` (str, e.g. `p100/p10014354/81739927/81739927_0002_seg0000`)
	- `ecg_fs` (float, Hz), `ecg_siglen` (int), `ecg_names` (list[str]), `ecg_time_s` (list[float]), `ecg` (list[list[float]], shape `[leads, time]`)
	- `ppg_fs`, `ppg_siglen`, `ppg_names` (`["Pleth"]`), `ppg_time_s`, `ppg` (shape `[1, time]`)
	- `segment_start_sec`, `segment_duration_sec`

	- Total shards: 412. Default HF "train" split contains only summary metadata — the real data must be pulled via `snapshot_download` + `load_from_disk` per shard.
	- Example record: 3-lead ECG `[3, 3200]` @ 249.89 Hz, PPG `[1, 1600]` @ 124.945 Hz, ~12.8 s duration.
	- ECG/PPG time vectors share the same segment-relative clock and start within `1/fs_ecg` of each other (sub-4 ms) → the mirror is sample-accurate aligned by construction (both signals come from the same underlying WFDB record).

	## Numbers (from 120 randomly sampled shards, seed 42)

	\| Quantity \| Value \|
	\|---\|---\|
	\| Segments scanned (metadata) \| 14,371 \|
	\| Unique patients observed \| 111 \|
	\| Patients extrapolated to full dataset \| ~381 \|
	\| Total duration sampled \| 237.0 h \|
	\| Total duration extrapolated \| ~814 h \|
	\| ECG sampling rate (median) \| 249.89 Hz \|
	\| PPG sampling rate (median) \| 124.95 Hz \|
	\| ECG siglen (median) \| 14,994 samples (≈60.0 s) \|
	\| PPG siglen (median) \| 7,497 samples (≈60.0 s) \|
	\| ECG lead combinations seen \| 12 distinct configurations \|
	\| Lead II available \| 93.7% of segments \|
	\| PPG channel \| `Pleth` (100%) \|
	\| Missing-value rate (NaN) \| 0.000% on ECG, 0.000% on PPG \|

	### ECG lead prevalence (top 10, count out of 14,371 segments)

	```
	II 13,471 (93.7%)
	V 12,326 (85.8%)
	aVR 11,218 (78.1%)
	III 1,748 (12.2%)
	aVF 399
	V2 221
	V5 221
	I 82
	```

	### PTT sanity (ECG R-peak → nearest PPG peak in [50, 500] ms, 1-to-1 only)

	\| Metric \| Peak-based (v1) \| Foot-based (v2) \|
	\|---\|---\|---\|
	\| Clean beats \| 10,193 \| 6,295 \|
	\| Good segments (≥3 clean beats) \| 150 / 158 attempted (95%) \| 100 / 100 \|
	\| PTT median \| 276 ms \| 288 ms \|
	\| PTT P5 / P95 \| 92 / 448 ms \| 144 / 476 ms \|
	\| Within-segment std, median \| 107 ms \| 104 ms \|

	- Both histograms are multimodal with satellite peaks separated by ~RR-interval fractions → peak-matching ambiguity, not dataset misalignment. A peak-on-the-next-beat mispick produces a ±200–300 ms shift and explains the 100-ms within-segment std directly.
	- The aligned 60-s ECG + PPG traces in `sanity_check.png` are visually locked beat-for-beat. Physiologically plausible PTT median.

	## Gate check (from `EXPERIMENT_TRACKING.md` E0)

	\| Gate \| Target \| Observed \| Status \|
	\|---\|---\|---\|---\|
	\| Median alignment ≤ 50 ms \| ≤ 50 ms \| Sub-sample alignment (shared clock); PTT median 276 ms is physiological, not a drift \| PASS (data-side); the 107 ms within-segment std is an artefact of the crude R→PPG nearest-peak estimator, not temporal misalignment \|
	\| PTT within-patient std ≤ 80 ms \| ≤ 80 ms \| Cannot be assessed cleanly with current peak detector — need `neurokit2`-grade PPG foot detector to disambiguate mispicks \| DEFERRED — revisit in E1 with better PPG detector; not a blocker for v1 (model sees raw patches) \|
	\| Patients ≥ 500 \| ≥ 500 \| ~381 extrapolated (111 confirmed in 120/412 shards) \| FAIL (marginal) \|
	\| Missing rate ≤ 20% after windowing \| ≤ 20% \| 0.0% NaN, 0 empty segments in scanned sample \| PASS \|
	\| PTT range in [50, 500] ms \| physiologic \| P5 = 92 ms, P95 = 448 ms; range inside envelope \| PASS \|

	## Interpretation of the patient-count "fail"

	The research plan's `≥500 patients` threshold was set before we knew the HF mirror's exact population. ~381 patients over ~814 h is:

	- Plenty of hours for JEPA pretraining (AnyPPG trained on 100k+ h, ECG-JEPA on 1M+ records — but Weimann's public checkpoints achieve 0.945 AUC with much less; and PhysioJEPA's architectural claim is about inductive bias on fixed data, not scale — this is explicitly acknowledged in `RESEARCH_DEVELOPMENT.md` §8 Critic 2).
	- Marginal for AF sample-efficiency (E5b) — we need ≥100 AF-positive and ≥100 AF-negative patients for the linear probe. With 381 patients this is tight but achievable if AF prevalence in MIMIC-IV ICU is ~10–20% (typical).
	- Below threshold for population generalization — we should pre-emptively frame the paper's N-scale caveat explicitly (expected reviewer pushback).

	### Action

	- Proceed with E1 and E2 on this dataset. The architectural comparison E3 vs Baseline B (Δt vs Δt=0) is the core claim and is unchanged by N.
	- Before E5b, decide AF label source (`EXPERIMENT_TRACKING.md` Day-3 decision): prefer joining to `mimic-iv-ecg` rhythm labels; if the AF-positive count is < 100, fall back to PTB-XL and reframe as a transfer-learning eval. This decision is now urgent.
	- Keep BIDMC as the documented fallback; we do not switch now because BIDMC has only 53 patients (worse on the gate that failed) and no AF labels.

	## Architectural implications for v1 (RESEARCH_DEVELOPMENT.md §2)

	The spec assumed 12-lead ECG @ 500 Hz. The HF mirror is 3-lead (primarily II/V/aVR) @ 250 Hz. Required revisions, staged for Day 3 architecture lock:

	1. ECG encoder input: single-lead II (93.7% coverage; drop records without it). Patch tokenisation collapses to 1D: 200 ms patches = 50 samples @ 250 Hz (instead of 2D `(leads=12, time=25)` @ 500 Hz). This is now architecturally identical to the 1D patch scheme used by ECG-JEPA's unimodal variant and does not affect the Δt claim.
	2. PPG encoder input: already 1D single-channel at 125 Hz → 200 ms patches = 25 samples, exactly as specified.
	3. Sampling-rate symmetry: both streams now satisfy ECG_fs = 2 × PPG_fs, matching the native MIMIC waveform format. No resampling needed.
	4. Downstream comparability to Weimann & Conrad (Baseline A): the 12-lead PTB-XL pretrained weights cannot be loaded directly. Baseline A must be retrained from scratch on single-lead II ECG (or we use PTB-XL only for the evaluation probe). Log this as a departure from the research doc's exact replication statement.

	## Files written

	- `docs/e0_report.json` — raw numbers
	- `docs/e0_alignment.json` — foot-based alignment check numbers
	- `docs/figures/ptt_histogram.png` — peak-based PTT (v1)
	- `docs/figures/ptt_histogram_foot.png` — foot-based PTT (v2)
	- `docs/figures/sanity_check.png` — 5 random 60-s aligned ECG+PPG overlays
	- `scripts/e0_peek.py`, `scripts/e0_audit.py`, `scripts/e0_audit_v2.py`, `scripts/e0_alignment_check.py`

	## Open follow-ups before E1 starts

	1. Verify AF-positive count after joining to `mimic-iv-ecg` (Zack, Day 3 gate).
	2. Swap PPG peak detector for `neurokit2.ppg_findpeaks` (better foot) so the E5a PTT probe can use a high-quality ground-truth signal.
	3. Commit an architectural-revision note to `RESEARCH_DEVELOPMENT.md` §2 and `ARCHITECTURES_EXPLORATION.md` Architecture F §v1 — single-lead ECG, 250 Hz, 50-sample patches.