PhysioJEPA / docs /e0_data_card.md
guychuk's picture
Upload folder using huggingface_hub
31e2456 verified

E0 β€” Data audit: lucky9-cyou/mimic-iv-aligned-ppg-ecg

PhysioJEPA β€” Oz Labs β€” 2026-04-14

Audit scripts: scripts/e0_audit_v2.py, scripts/e0_alignment_check.py Raw JSON: docs/e0_report.json, docs/e0_alignment.json Figures: docs/figures/ptt_histogram.png, docs/figures/ptt_histogram_foot.png, docs/figures/sanity_check.png


Decision

GO β€” with one caveat: the β‰₯500-patient gate is borderline (~381 extrapolated). Proceeding on MIMIC-IV HF mirror; BIDMC remains as fallback if downstream label yield (AF) is insufficient.

See the gate table below for the full reasoning.


Dataset layout

  • 412 HF save_to_disk shard folders. Each shard β‰ˆ 100 segments β‰ˆ 1 MIMIC-IV waveform record β‰ˆ 1 patient.

  • Schema per row (verified against shard_00000/dataset_info.json):

    • record_name (str, e.g. p100/p10014354/81739927/81739927_0002_seg0000)
    • ecg_fs (float, Hz), ecg_siglen (int), ecg_names (list[str]), ecg_time_s (list[float]), ecg (list[list[float]], shape [leads, time])
    • ppg_fs, ppg_siglen, ppg_names (["Pleth"]), ppg_time_s, ppg (shape [1, time])
    • segment_start_sec, segment_duration_sec
  • Total shards: 412. Default HF "train" split contains only summary metadata β€” the real data must be pulled via snapshot_download + load_from_disk per shard.

  • Example record: 3-lead ECG [3, 3200] @ 249.89 Hz, PPG [1, 1600] @ 124.945 Hz, ~12.8 s duration.

  • ECG/PPG time vectors share the same segment-relative clock and start within 1/fs_ecg of each other (sub-4 ms) β†’ the mirror is sample-accurate aligned by construction (both signals come from the same underlying WFDB record).

Numbers (from 120 randomly sampled shards, seed 42)

Quantity Value
Segments scanned (metadata) 14,371
Unique patients observed 111
Patients extrapolated to full dataset ~381
Total duration sampled 237.0 h
Total duration extrapolated ~814 h
ECG sampling rate (median) 249.89 Hz
PPG sampling rate (median) 124.95 Hz
ECG siglen (median) 14,994 samples (β‰ˆ60.0 s)
PPG siglen (median) 7,497 samples (β‰ˆ60.0 s)
ECG lead combinations seen 12 distinct configurations
Lead II available 93.7% of segments
PPG channel Pleth (100%)
Missing-value rate (NaN) 0.000% on ECG, 0.000% on PPG

ECG lead prevalence (top 10, count out of 14,371 segments)

II     13,471 (93.7%)
V      12,326 (85.8%)
aVR    11,218 (78.1%)
III     1,748 (12.2%)
aVF       399
V2        221
V5        221
I          82

PTT sanity (ECG R-peak β†’ nearest PPG peak in [50, 500] ms, 1-to-1 only)

Metric Peak-based (v1) Foot-based (v2)
Clean beats 10,193 6,295
Good segments (β‰₯3 clean beats) 150 / 158 attempted (95%) 100 / 100
PTT median 276 ms 288 ms
PTT P5 / P95 92 / 448 ms 144 / 476 ms
Within-segment std, median 107 ms 104 ms
  • Both histograms are multimodal with satellite peaks separated by ~RR-interval fractions β†’ peak-matching ambiguity, not dataset misalignment. A peak-on-the-next-beat mispick produces a Β±200–300 ms shift and explains the 100-ms within-segment std directly.
  • The aligned 60-s ECG + PPG traces in sanity_check.png are visually locked beat-for-beat. Physiologically plausible PTT median.

Gate check (from EXPERIMENT_TRACKING.md E0)

Gate Target Observed Status
Median alignment ≀ 50 ms ≀ 50 ms Sub-sample alignment (shared clock); PTT median 276 ms is physiological, not a drift PASS (data-side); the 107 ms within-segment std is an artefact of the crude Rβ†’PPG nearest-peak estimator, not temporal misalignment
PTT within-patient std ≀ 80 ms ≀ 80 ms Cannot be assessed cleanly with current peak detector β€” need neurokit2-grade PPG foot detector to disambiguate mispicks DEFERRED β€” revisit in E1 with better PPG detector; not a blocker for v1 (model sees raw patches)
Patients β‰₯ 500 β‰₯ 500 ~381 extrapolated (111 confirmed in 120/412 shards) FAIL (marginal)
Missing rate ≀ 20% after windowing ≀ 20% 0.0% NaN, 0 empty segments in scanned sample PASS
PTT range in [50, 500] ms physiologic P5 = 92 ms, P95 = 448 ms; range inside envelope PASS

Interpretation of the patient-count "fail"

The research plan's β‰₯500 patients threshold was set before we knew the HF mirror's exact population. ~381 patients over ~814 h is:

  • Plenty of hours for JEPA pretraining (AnyPPG trained on 100k+ h, ECG-JEPA on 1M+ records β€” but Weimann's public checkpoints achieve 0.945 AUC with much less; and PhysioJEPA's architectural claim is about inductive bias on fixed data, not scale β€” this is explicitly acknowledged in RESEARCH_DEVELOPMENT.md Β§8 Critic 2).
  • Marginal for AF sample-efficiency (E5b) β€” we need β‰₯100 AF-positive and β‰₯100 AF-negative patients for the linear probe. With 381 patients this is tight but achievable if AF prevalence in MIMIC-IV ICU is ~10–20% (typical).
  • Below threshold for population generalization β€” we should pre-emptively frame the paper's N-scale caveat explicitly (expected reviewer pushback).

Action

  • Proceed with E1 and E2 on this dataset. The architectural comparison E3 vs Baseline B (Ξ”t vs Ξ”t=0) is the core claim and is unchanged by N.
  • Before E5b, decide AF label source (EXPERIMENT_TRACKING.md Day-3 decision): prefer joining to mimic-iv-ecg rhythm labels; if the AF-positive count is < 100, fall back to PTB-XL and reframe as a transfer-learning eval. This decision is now urgent.
  • Keep BIDMC as the documented fallback; we do not switch now because BIDMC has only 53 patients (worse on the gate that failed) and no AF labels.

Architectural implications for v1 (RESEARCH_DEVELOPMENT.md Β§2)

The spec assumed 12-lead ECG @ 500 Hz. The HF mirror is 3-lead (primarily II/V/aVR) @ 250 Hz. Required revisions, staged for Day 3 architecture lock:

  1. ECG encoder input: single-lead II (93.7% coverage; drop records without it). Patch tokenisation collapses to 1D: 200 ms patches = 50 samples @ 250 Hz (instead of 2D (leads=12, time=25) @ 500 Hz). This is now architecturally identical to the 1D patch scheme used by ECG-JEPA's unimodal variant and does not affect the Ξ”t claim.
  2. PPG encoder input: already 1D single-channel at 125 Hz β†’ 200 ms patches = 25 samples, exactly as specified.
  3. Sampling-rate symmetry: both streams now satisfy ECG_fs = 2 Γ— PPG_fs, matching the native MIMIC waveform format. No resampling needed.
  4. Downstream comparability to Weimann & Conrad (Baseline A): the 12-lead PTB-XL pretrained weights cannot be loaded directly. Baseline A must be retrained from scratch on single-lead II ECG (or we use PTB-XL only for the evaluation probe). Log this as a departure from the research doc's exact replication statement.

Files written

  • docs/e0_report.json β€” raw numbers
  • docs/e0_alignment.json β€” foot-based alignment check numbers
  • docs/figures/ptt_histogram.png β€” peak-based PTT (v1)
  • docs/figures/ptt_histogram_foot.png β€” foot-based PTT (v2)
  • docs/figures/sanity_check.png β€” 5 random 60-s aligned ECG+PPG overlays
  • scripts/e0_peek.py, scripts/e0_audit.py, scripts/e0_audit_v2.py, scripts/e0_alignment_check.py

Open follow-ups before E1 starts

  1. Verify AF-positive count after joining to mimic-iv-ecg (Zack, Day 3 gate).
  2. Swap PPG peak detector for neurokit2.ppg_findpeaks (better foot) so the E5a PTT probe can use a high-quality ground-truth signal.
  3. Commit an architectural-revision note to RESEARCH_DEVELOPMENT.md Β§2 and ARCHITECTURES_EXPLORATION.md Architecture F Β§v1 β€” single-lead ECG, 250 Hz, 50-sample patches.