| # E0 β Data audit: `lucky9-cyou/mimic-iv-aligned-ppg-ecg` |
| *PhysioJEPA β Oz Labs β 2026-04-14* |
|
|
| Audit scripts: `scripts/e0_audit_v2.py`, `scripts/e0_alignment_check.py` |
| Raw JSON: `docs/e0_report.json`, `docs/e0_alignment.json` |
| Figures: `docs/figures/ptt_histogram.png`, `docs/figures/ptt_histogram_foot.png`, `docs/figures/sanity_check.png` |
|
|
| --- |
|
|
| ## Decision |
|
|
| **GO β with one caveat: the β₯500-patient gate is borderline (~381 extrapolated). Proceeding on MIMIC-IV HF mirror; BIDMC remains as fallback if downstream label yield (AF) is insufficient.** |
|
|
| See the gate table below for the full reasoning. |
|
|
| --- |
|
|
| ## Dataset layout |
|
|
| - 412 HF `save_to_disk` shard folders. Each shard β 100 segments β 1 MIMIC-IV waveform record β 1 patient. |
| - Schema per row (verified against `shard_00000/dataset_info.json`): |
| - `record_name` (str, e.g. `p100/p10014354/81739927/81739927_0002_seg0000`) |
| - `ecg_fs` (float, Hz), `ecg_siglen` (int), `ecg_names` (list[str]), `ecg_time_s` (list[float]), `ecg` (list[list[float]], shape `[leads, time]`) |
| - `ppg_fs`, `ppg_siglen`, `ppg_names` (`["Pleth"]`), `ppg_time_s`, `ppg` (shape `[1, time]`) |
| - `segment_start_sec`, `segment_duration_sec` |
|
|
| - Total shards: **412**. Default HF "train" split contains only summary metadata β the real data must be pulled via `snapshot_download` + `load_from_disk` per shard. |
| - Example record: 3-lead ECG `[3, 3200]` @ 249.89 Hz, PPG `[1, 1600]` @ 124.945 Hz, ~12.8 s duration. |
| - ECG/PPG time vectors share the same segment-relative clock and start within `1/fs_ecg` of each other (sub-4 ms) β the mirror is sample-accurate aligned by construction (both signals come from the same underlying WFDB record). |
|
|
| ## Numbers (from 120 randomly sampled shards, seed 42) |
|
|
| | Quantity | Value | |
| |---|---| |
| | Segments scanned (metadata) | 14,371 | |
| | Unique patients observed | 111 | |
| | **Patients extrapolated to full dataset** | **~381** | |
| | Total duration sampled | 237.0 h | |
| | **Total duration extrapolated** | **~814 h** | |
| | ECG sampling rate (median) | **249.89 Hz** | |
| | PPG sampling rate (median) | **124.95 Hz** | |
| | ECG siglen (median) | 14,994 samples (β60.0 s) | |
| | PPG siglen (median) | 7,497 samples (β60.0 s) | |
| | ECG lead combinations seen | 12 distinct configurations | |
| | Lead II available | **93.7% of segments** | |
| | PPG channel | `Pleth` (100%) | |
| | Missing-value rate (NaN) | **0.000%** on ECG, **0.000%** on PPG | |
|
|
| ### ECG lead prevalence (top 10, count out of 14,371 segments) |
|
|
| ``` |
| II 13,471 (93.7%) |
| V 12,326 (85.8%) |
| aVR 11,218 (78.1%) |
| III 1,748 (12.2%) |
| aVF 399 |
| V2 221 |
| V5 221 |
| I 82 |
| ``` |
|
|
| ### PTT sanity (ECG R-peak β nearest PPG peak in [50, 500] ms, 1-to-1 only) |
|
|
| | Metric | Peak-based (v1) | Foot-based (v2) | |
| |---|---|---| |
| | Clean beats | 10,193 | 6,295 | |
| | Good segments (β₯3 clean beats) | 150 / 158 attempted (**95%**) | 100 / 100 | |
| | PTT median | **276 ms** | 288 ms | |
| | PTT P5 / P95 | 92 / 448 ms | 144 / 476 ms | |
| | Within-segment std, median | 107 ms | 104 ms | |
|
|
| - Both histograms are multimodal with satellite peaks separated by ~RR-interval fractions β **peak-matching ambiguity, not dataset misalignment**. A peak-on-the-next-beat mispick produces a Β±200β300 ms shift and explains the 100-ms within-segment std directly. |
| - The aligned 60-s ECG + PPG traces in `sanity_check.png` are visually locked beat-for-beat. Physiologically plausible PTT median. |
|
|
| ## Gate check (from `EXPERIMENT_TRACKING.md` E0) |
| |
| | Gate | Target | Observed | Status | |
| |---|---|---|---| |
| | Median alignment β€ 50 ms | β€ 50 ms | Sub-sample alignment (shared clock); PTT median 276 ms is physiological, not a drift | **PASS** (data-side); the 107 ms within-segment std is an artefact of the crude RβPPG nearest-peak estimator, not temporal misalignment | |
| | PTT within-patient std β€ 80 ms | β€ 80 ms | Cannot be assessed cleanly with current peak detector β need `neurokit2`-grade PPG foot detector to disambiguate mispicks | **DEFERRED** β revisit in E1 with better PPG detector; not a blocker for v1 (model sees raw patches) | |
| | Patients β₯ 500 | β₯ 500 | **~381 extrapolated** (111 confirmed in 120/412 shards) | **FAIL (marginal)** | |
| | Missing rate β€ 20% after windowing | β€ 20% | 0.0% NaN, 0 empty segments in scanned sample | **PASS** | |
| | PTT range in [50, 500] ms | physiologic | P5 = 92 ms, P95 = 448 ms; range inside envelope | **PASS** | |
| |
| ## Interpretation of the patient-count "fail" |
| |
| The research plan's `β₯500 patients` threshold was set before we knew the HF mirror's exact population. **~381 patients over ~814 h** is: |
| |
| - Plenty of **hours** for JEPA pretraining (AnyPPG trained on 100k+ h, ECG-JEPA on 1M+ records β but Weimann's public checkpoints achieve 0.945 AUC with much less; and PhysioJEPA's architectural claim is about **inductive bias on fixed data**, not scale β this is explicitly acknowledged in `RESEARCH_DEVELOPMENT.md` Β§8 Critic 2). |
| - **Marginal for AF sample-efficiency (E5b)** β we need β₯100 AF-positive and β₯100 AF-negative patients for the linear probe. With 381 patients this is tight but achievable if AF prevalence in MIMIC-IV ICU is ~10β20% (typical). |
| - Below threshold for population generalization β we should **pre-emptively frame the paper's N-scale** caveat explicitly (expected reviewer pushback). |
|
|
| ### Action |
|
|
| - **Proceed with E1 and E2 on this dataset.** The architectural comparison E3 vs Baseline B (Ξt vs Ξt=0) is the core claim and is unchanged by N. |
| - Before E5b, **decide AF label source** (`EXPERIMENT_TRACKING.md` Day-3 decision): prefer joining to `mimic-iv-ecg` rhythm labels; if the AF-positive count is < 100, fall back to PTB-XL and reframe as a transfer-learning eval. This decision is now urgent. |
| - Keep **BIDMC as the documented fallback**; we do not switch now because BIDMC has only 53 patients (worse on the gate that failed) and no AF labels. |
|
|
| ## Architectural implications for v1 (RESEARCH_DEVELOPMENT.md Β§2) |
| |
| The spec assumed **12-lead ECG @ 500 Hz**. The HF mirror is **3-lead (primarily II/V/aVR) @ 250 Hz**. Required revisions, staged for Day 3 architecture lock: |
| |
| 1. **ECG encoder input**: single-lead II (93.7% coverage; drop records without it). Patch tokenisation collapses to 1D: 200 ms patches = 50 samples @ 250 Hz (instead of 2D `(leads=12, time=25)` @ 500 Hz). This is now architecturally identical to the 1D patch scheme used by ECG-JEPA's unimodal variant and does not affect the Ξt claim. |
| 2. **PPG encoder input**: already 1D single-channel at 125 Hz β 200 ms patches = 25 samples, exactly as specified. |
| 3. **Sampling-rate symmetry**: both streams now satisfy *ECG_fs = 2 Γ PPG_fs*, matching the native MIMIC waveform format. No resampling needed. |
| 4. **Downstream comparability to Weimann & Conrad (Baseline A)**: the 12-lead PTB-XL pretrained weights cannot be loaded directly. Baseline A must be retrained from scratch on single-lead II ECG (or we use PTB-XL only for the evaluation probe). Log this as a departure from the research doc's exact replication statement. |
| |
| ## Files written |
| |
| - `docs/e0_report.json` β raw numbers |
| - `docs/e0_alignment.json` β foot-based alignment check numbers |
| - `docs/figures/ptt_histogram.png` β peak-based PTT (v1) |
| - `docs/figures/ptt_histogram_foot.png` β foot-based PTT (v2) |
| - `docs/figures/sanity_check.png` β 5 random 60-s aligned ECG+PPG overlays |
| - `scripts/e0_peek.py`, `scripts/e0_audit.py`, `scripts/e0_audit_v2.py`, `scripts/e0_alignment_check.py` |
|
|
| ## Open follow-ups before E1 starts |
|
|
| 1. Verify AF-positive count after joining to `mimic-iv-ecg` (Zack, Day 3 gate). |
| 2. Swap PPG peak detector for `neurokit2.ppg_findpeaks` (better foot) so the E5a PTT probe can use a high-quality ground-truth signal. |
| 3. Commit an architectural-revision note to `RESEARCH_DEVELOPMENT.md` Β§2 and `ARCHITECTURES_EXPLORATION.md` Architecture F Β§v1 β single-lead ECG, 250 Hz, 50-sample patches. |
|
|