PhysioJEPA / docs /RESEARCH_LOG.md

Upload folder using huggingface_hub

31e2456 verified 24 days ago

preview code

raw

history blame contribute delete

43.3 kB

PhysioJEPA research log

Running narrative — newest entries at top.

Format: each entry is ## YYYY-MM-DD HH:MM — [PHASE] — topic followed by bullet list of what was done, what was found, and any decisions/caveats.

2026-04-16 09:35 — definitive run: all 3 pods bootstrapping

All 3 definitive-run pods deployed:

F: H100 PCIe secure ($2.39/h) @ 216.81.245.97:18654 — still in index build A: A100 SXM comm ($1.39/h) @ 216.249.100.66:20011 — in precompute (454k windows) B: A100 SXM secure ($1.49/h) @ 154.54.102.26:17999 — just started pip install

Config: 100 epochs, full data (subset_frac=1.0 via fast_cache_dir mmap), mask_ratio=0.75, batch_size=64, seed=42, num_workers=12.

Aggregate: $5.27/h. Balance: $118.90. At 20h projected = $105.

Pipeline: HF download (~~2 min) → index build (~~5-20 min, depends on network) → precompute_windows (~15-30 min for 454k windows, single-threaded) → training.

A is furthest along (precompute started). F is behind (slower download). B just started. First [step 0] expected in ~30 min from A.

2026-04-16 04:40 — full-scale run scoping: need data pipeline optimization first

User requested 3× H100, full data, 100 epochs, mask=0.75. Budget check:

Balance: $118.90. H100 PCIe community: $1.99/h × 3 = $5.97/h.
Steps: ~6160/epoch × 100 = 616k per run.
sec/step on A40 was 2.8 (production) vs 0.58 (benchmark). Even on H100 with faster CPU, realistic production sec/step is ~1.0-1.5.
At 1.2 sec/step: 616k × 1.2 / 3600 = 205h per run × 3 runs × $2/h = $1230. WAY over budget.

Root cause: getitem calls load_from_disk per-shard + bandpass + zscore per window at runtime. This dominates training time by 5× over GPU forward.

Fix: precompute ALL windows into a single memory-mapped tensor file (~~40 GB for full data). getitem becomes a single mmap read (~~0.1ms). sec/step drops to ~0.3, bringing total runtime to ~51h across 3 A100 runs = ~$100. Fits budget.

Building the precompute script now.

2026-04-16 04:25 — FINAL: abl3 ep25 = 0.848, all pods killed

abl3 (mask=0.75, unimodal A) epoch 25 AUROC = 0.848.

Complete results table:

Model	mask	L_self peak	ep5	ep10	ep15	ep20	ep25
original A	0.50	0.476	0.783	0.736	—	—	0.703
abl1 (pd=1)	0.50	0.438	—	—	0.749	—	—
abl2 (sin-q)	0.50	0.559	—	—	0.784	—	—
abl3 (m=75)	0.75	0.200	—	—	0.838	0.845	0.848
abl4 (full data)	0.50	0.587+	—	—	—	—	(killed; spike confirmed)
B (Δt=0)	—	—	0.660	0.844	—	—	0.847
F (Δt>0)	—	—	0.652	0.859	—	—	0.835

abl3 (0.848) ≈ B (0.847). Unimodal JEPA with 75% masking exactly matches cross-modal JEPA. The mechanism story is complete.

abl4 (full data, 50% mask) showed L_self spike peaking at 0.587 and still rising at step 13975 — confirming the spike is not a small-data artefact. Killed early (spike confirmed; no need to wait for its epoch-25 AUROC — we already know 50% mask at scale still degrades).

All pods killed. Zero stale compute. Total ablation spend: ~$4.50.

2026-04-16 03:10 — AUROC confirms mechanism end-to-end

Epoch-15 AUROC on PTB-XL AF:

variant	L_self peak	AUROC @ ep15
original A	0.476	0.736
abl1 (pd=1)	0.438	0.749
abl2 (sin-q)	0.559	0.784
abl3 (m=75)	0.196	0.838
(ref) B ep10	—	0.844
(ref) F ep10	—	0.859

abl3 matches B/F's AUROC at epoch 15. Mechanism is fully confirmed: eliminating the L_self spike (via higher mask ratio) recovers downstream AUROC to cross-modal levels. Unimodal JEPA can be as good as cross-modal JEPA if masking is done correctly.

Subtle finding from abl2: sinusoidal query has a LARGER L_self spike (0.559 vs orig 0.476) but HIGHER AUROC (0.784 vs 0.736). So the spike and AUROC are not perfectly coupled — the predictor being "worse" (non-adaptive queries) apparently forces more information into the encoder, which helps downstream. Noting as an interesting secondary finding, but abl3 is the main story.

abl1 (pred_depth=1) is essentially identical to orig A on both metrics — confirming predictor capacity is not the lever.

Paper now has a clean, precise story

Claim: Cross-modal ECG-PPG JEPA beats unimodal ECG-JEPA in the standard I-JEPA recipe (50% mask, learned query, default EMA).
Mechanism: at 50% mask the predictor finds a local-interpolation shortcut (25 visible context ↔ 25 target contiguous blocks → linear blend of adjacent patches works). Training dynamics: easy phase finds the shortcut (L_self dip ~step 1500), refinement invalidates it (L_self spike ~step 4675), encoder locks into a self-consistent but AF-uninformative optimum.
Fixes: (a) mask ratio 0.75 denies the shortcut structurally — abl3 matches cross-modal AUROC. (b) Cross-modal prediction is the same mechanism — 0% PPG visible context → no interpolation path — F and B both stable.
Δt direction doesn't matter (K2 fail is a negative result that supports the mechanism: the Δt token is a tiny perturbation of the predictor's query set; what matters is whether interpolation is available, not where the targets sit on the time axis).

Actionable recommendation: ECG-JEPA (Weimann & Conrad) used 50% masking. 75% masking is a likely-free improvement, testable on PTB-XL directly.

Status

abl1 + abl2 pods killed. Answered their questions.
abl3 running to epoch 25 for the final number. ~1 h left at $0.44/h.
abl4 (full data) at step 9975 with L_self=0.54 — spike IS present at full data, just delayed. More data slows shortcut discovery but doesn't eliminate it. Confirms mask ratio is the architectural fix, not a small-data artifact.
abl4 still has ~20h to go. Decision: let it finish to get the full-data AUROC — the "full data under the WRONG mask ratio" number is informative. At $0.44/h × 20h = $8.80. Still well under budget.

2026-04-16 02:05 — mask_ratio IS the lever (spike window confirmed)

Full matrix at the critical spike window (original A peaks L_self=0.476 at step 4675):

step | orig A | abl1 (pd=1) | abl2 (sin-q) | abl3 (m=75) | abl4 (full) ------+--------+-------------+--------------+-----------------+------------ 1475 | 0.220 | 0.222 | 0.329 | 0.146 | 0.296 2475 | 0.340 | 0.339 | 0.482 | 0.165 | 0.233 3475 | 0.442 | 0.420 | 0.555 | 0.186 | 0.208 4475 | 0.476 | 0.438 | 0.559 | 0.196 | 0.260 4975 | 0.475 | 0.398 | 0.551 | 0.200 | 0.287 5475 | — | 0.334 | 0.512 | — | 0.313

abl3 (mask 0.75) has NO spike. L_self rises monotonically from 0.146 (step 1475) to 0.200 (step 4975) — a gentle climb of +0.05 over 3500 steps, vs orig A's explosive +0.26 peak.

abl1 (pred_depth=1) tracks orig A. Predictor capacity is not the lever.

abl2 (sinusoidal queries) has a LARGER spike than orig A (0.559 peak vs 0.476). Removing the adaptive query hurts — the predictor can't route context tokens to targets it cares about.

abl4 (full data) shows a muted spike (0.208 → 0.313 over 2000 steps). 10× data slows shortcut discovery but doesn't eliminate it. Suggests scale helps but mask_ratio is the cleaner fix.

Revised mechanism — unified story

50% masking gives the predictor 25 target patches and 25 visible context patches arranged in contiguous blocks. Early training, the predictor learns a short-range interpolation shortcut: predict masked patch p as a linear blend of adjacent visible patches. This gives a low L_self quickly (dip at step 1500). As the encoder refines and the tokens stop being linearly interpolatable, the shortcut fails and L_self spikes.

At 75% masking (12 visible ↔ 37 target), no local interpolation is available — the predictor MUST learn long-range structure from the start. No dip, no rebound.

Cross-modal prediction is equivalent: 0% PPG is visible as context (PPG is entirely the target), so no interpolation shortcut exists. F and B dodge the spike by the same mechanism as abl3.

Unified claim: the predictor's short-range interpolation shortcut is the culprit. Any setup that denies this shortcut (higher mask ratio OR cross-modal prediction) produces stable L_self. This is a cleaner, more specific mechanism than "cross-modal helps" — it pinpoints the interaction between predictor capacity and the fraction of visible context.

Next test: AUROC recovery

Does abl3's no-spike training actually produce better AF representations? Kicked off PTB-XL fetch on abl3 pod in parallel with training. Will probe all 4 ablation ckpts once training completes (~2-3 h).

Prediction: if the mechanism story is correct, abl3 AUROC @ ep25 > orig A's 0.703, should approach F/B's 0.83-0.85.

2026-04-16 01:15 — ablation early signal: abl3 (mask 75%) breaks the pattern

L_self side-by-side at matched steps (only the key ones):

step | orig A | abl1(pd=1) | abl2(sin-q) | abl3(m=75) | abl4(full) ------+--------+------------+-------------+------------+----------- 975 | 0.247 | 0.248 | 0.267 | 0.197 | 0.390 1475 | 0.220 | 0.223 | 0.292 | 0.144 | 0.285 (interp) 1775 | 0.243 | 0.255 | 0.371 | 0.148 | 0.269 1975 | 0.256 | 0.269 | 0.403 | — | 0.254 2175 | 0.283 | 0.297 | 0.447 | — | 0.230 (interp)

abl3 (mask 0.75) is markedly different. L_self at step 1775 is 0.148, lower than original A's minimum of 0.220. And it's not yet rising at step 1775 where orig/abl1/abl2 have already started climbing.

abl1 (pred_depth=1) ≈ orig A. The predictor size was not the driver.

abl2 (sinusoidal query) is WORSE than orig A. By step 1775 it's at 0.371 vs orig A at 0.243. Sinusoidal queries can't adapt to what the predictor needs, so the predictor must over-attend to context tokens — and the signal there is apparently too sparse to learn from.

abl4 (full data) is descending monotonically at step 1975 (L_self=0.254). Too early to say if it avoids the spike — original A's spike was at step 4675. Full data is ~10× slower per logical training "epoch" so the spike location in wall-clock terms shifts late. Continue monitoring.

Revised mechanism hypothesis: unimodal JEPA at mask_ratio=0.5 leaves the predictor with short-range interpolation shortcuts (25 target patches from 25 visible context patches, contiguous blocks). Early training finds these shortcuts (L_self dips at step 1500). As the encoder refines and invalidates the shortcuts, L_self rises. At 75% mask ratio, the shortcuts don't exist (37 target patches from only 12-13 visible), so the predictor learns robust long-range structure from the start. No dip-and-rebound.

This is mechanism-specific, falsifiable, and explains both: (a) why F/B didn't drift (cross-modal loss provides a diverse, non-local target that can't be locally interpolated) (b) why abl3 fixed it in unimodal A (higher masking also eliminates the local shortcut)

Now the critical follow-up: does abl3's epoch-25 AUROC match F/B (~0.84)? That would complete the mechanism-to-downstream story.

Cost check: 4×A40×$0.44 × ~~45 min = ~$1.32 so far. abl1/2/3 ~3.5 h to go (~~$5). abl4 ~~30 h to go (~~$13). Total ~$20 for the suite. Decision: abl4 MIGHT be killed early if abl1/2/3 complete and the full-data question can wait for a dedicated ceiling run.

2026-04-16 00:30 — 4 parallel A ablations launched on A40 secure pods

To find the real mechanism behind A's degradation, running 4 ablations in parallel. Each identical to original A except one variable.

abl1: pred_depth 4 → 1 (pod 0n8im5mri5hjk0, 69.30.85.78:22121) abl2: query_mode learned → sinusoidal (pod a2pye2ki7uvw47, 194.68.245.208:22053) abl3: mask_ratio 0.5 → 0.75 (pod jwwln4klav8674, 194.68.245.207:22198) abl4: subset_frac 0.10 → 1.00 (pod 4pvp7yb1rmbxta, 194.68.245.207:22197)

All on A40 secure ($0.44/h × 4 = $1.76/h aggregate). 25 epochs each. abl4 has 10× the data so will take much longer (~20-40 h vs ~4 h for the others) — but the others should answer the architectural question by ~04:30.

Hypotheses:

abl1 (smaller predictor): if predictor capacity drove overfit, L_self spike shrinks. AUROC may improve.
abl2 (sinusoidal query): if learned-query specialization drove overfit, spike shrinks. AUROC may improve.
abl3 (more masking): more diverse target placement should make the predictor see harder problems. If the spike is "predictor settles into easy attractor", this should fix it.
abl4 (full data): if 10% subset was the culprit, spike disappears at scale. If still present, it's an architectural issue independent of data scale.

Spike location to compare against: original A had L_self spike peaking 0.475 at step 4675 (when τ=0.9999).

2026-04-15 21:59 — slow-τ A ablation RESULT: hypothesis FALSIFIED, pod killed

Side-by-side L_self at matched steps:

step | orig A | slow-τ A | orig τ | slow τ ------+--------+----------+--------+-------- 1475 | 0.22 | 0.22 | 0.9969 | 0.9962 1975 | 0.26 | 0.28 | 0.9974 | 0.9963 2975 | 0.40 | 0.49 | 0.9988 | 0.9967 3975 | 0.45 | 0.60 | 0.9997 | 0.9972 4975 | 0.47 | 0.60 | 0.9999 | 0.9977 5475 | 0.46 | 0.55 | 0.9999 | 0.9979

Slow-τ A's L_self rose MORE than original A's, not less, despite τ being well below saturation through the critical window. The "τ saturation amplifies the L_self spike" hypothesis is falsified.

The L_self rise must be driven by something else. Top candidates:

Masking strategy (multi-block 50% ratio) + small data regime — the predictor overfits to easy target patches early (dip at step 1500), then the distribution of hard targets dominates as the encoder refines.
Query-embedding parameter specialization — the learnable query tokens narrow predictive scope, and random target placement starts hitting targets they can't handle.
Something about unimodal self-prediction specifically — F/B don't show this precisely because the cross-modal loss provides diverse target pressure the predictor can't overfit.

What survives from the original claim:

K3 still holds empirically: cross-modal (F=0.835, B=0.847) >> unimodal (A=0.703) at epoch 25.
The mechanism story needs replacing. "Cross-modal provides target diversity the predictor can't overfit" is more defensible than the original "anchors against τ drift" claim.

Pod y27osaqv7amz7d killed. Ablation cost: ~$0.35 for ~2 h on A5000 community.

Impact on user's plan:

Conditional was: if spike disappears → full-data B run. Spike did not disappear. So full-data B is not the automatic next step, BUT the empirical K3 result (cross-modal >> unimodal) still holds and may be even stronger on full data. Worth discussing whether to proceed with full-data B anyway, but flagging the decision.

2026-04-15 21:19 — slow-τ A ablation training (early signal: L_self rising even pre-τ-saturation)

Slow-τ A early trajectory (log_every=25): step 0: L_self = 1.167 (random init) step 475: L_self = 0.390 step 975: L_self = 0.247 step 1475: L_self = 0.223 ← minimum step 1975: L_self = 0.282 step 2175: L_self = 0.313 ← rising, tau still only 0.9963

Original A at comparable steps (before any spike): step 500: L_self = 0.380 step 1000: L_self = 0.247 step 1500: L_self = 0.220 ← minimum step 2000: L_self = 0.258 step 2225: L_self = 0.283

Slow-τ A is tracking original A essentially step-for-step so far. Both hit their minimum ~step 1500, both starting to rise by step 2000. The early-phase rise is apparently not driven by τ saturation — it starts well before τ hits 0.999.

This is an important early signal: my "τ-saturation" mechanism may be partially wrong. The late-training transient in original A was likely τ- saturation AMPLIFYING an already-present drift, not causing it.

Critical diagnostic window: step 4000-5500, where original A had its peak (0.48 at step 4675). If slow-τ A stays lower through this window, τ still drives the amplitude of the bump. If slow-τ A also spikes at step 4675, τ is not the driver.

2026-04-15 20:20 — slow-τ A ablation launched

Ablation pod: y27osaqv7amz7d (RTX A5000 community, FR). Config: ema_end = 0.999 (vs 0.9999 in original) ema_warmup_frac = 0.60 (vs 0.30 in original) everything else identical: subset_frac=0.10, bs=64, 25 epochs, seed=42

Prediction:

If A spike at step 4675 disappears + AUROC recovers to ~0.84 → τ-saturation mechanism is confirmed, cross-modal anchor story holds.
If spike disappears BUT AUROC stays at ~0.70 → the original A's problem wasn't τ saturation per se; the unimodal objective just doesn't contain enough AF-discriminative signal at this data scale.
If spike still present → τ schedule isn't the lever; something deeper.

Conditional on spike disappearing + AUROC recovering, next step is the full-data B run (100 epochs, H100, 814h) — the ceiling measurement.

2026-04-15 20:00 — refined mechanism for A degradation (not monotonic drift)

After pulling full WandB curves, correcting my earlier "A drifts monotonically" claim. A actually has:

L_self minimum at step 1500 (value 0.22)
τ-saturation TRANSIENT at step 4675 (value 0.475) — 3× the bump F/B show
recovery by step 7400 (value 0.20)
late-training slow climb to 0.20 at step 15350

F and B also show late-training L_self rise (0.15 → 0.27). Only the mid-training transient is unique to A.

Key finding: A's loss recovers but AUROC doesn't. AUROC dropped from 0.783 (ep5) → 0.703 (ep25) even though final L_self is comparable to F/B. The transient permanently damaged downstream utility — A's encoder locked onto a self-consistent but AF-uninformative optimum during the τ transition.

Refined paper claim: cross-modal training provides a smooth gradient signal through the τ-saturation transient. Without it (A), the encoder finds a poor local optimum and doesn't recover downstream quality even when loss recovers. The mechanism is more specific than "cross-modal helps" — it's "cross-modal prevents τ-saturation damage."

2026-04-15 19:30 — FULL K-gate results: K2 FAIL, K3 PASS

All 4 pods ran to epoch 25. Full probe matrix on PTB-XL AF:

Model	ep5	ep10	ep25
F (Δt>0)	0.6521	0.8586	0.8352
B (Δt=0)	0.6599	0.8440	0.8467
A (uni)	0.7832	0.7357	0.7025
C (InfoNCE)	stuck at ~loss 3.0 — under-tuned baseline, not usable

K2 FAIL: F − B = −0.012 at epoch 25 (target was ≥ +0.02). K3 PASS BIG: F − A = +0.133 at epoch 25, and A is DEGRADING.

Written up in docs/e2_e3_results.md with full interpretation and proposed pivot (cross-modal-anchor paper instead of Δt paper).

Spend total: ~$6.14 across 4 pods × ~4.5 h. Vastly under budget.

Pods still have ckpt_final.pt but training is done. Ready to terminate.

2026-04-15 11:55 — FIRST AUROC: F at epoch 10 = 0.859

F (PhysioJEPA, Δt>0) AUROC on PTB-XL AF detection: epoch 5 (step ~3200): 0.652 epoch 10 (step ~6400): 0.859 ← latest

The jump 0.65 → 0.86 in 5 epochs tells us F is rapidly absorbing AF-relevant features. Trajectory still climbing — we'd expect further gains by epoch 25.

Framing correction (user call-out): "approaching Weimann 0.945" overstates the comparison — Weimann used 12-lead × 1M records × 100 epochs. F is single-lead II × 40k windows × 10 epochs. What matters is the trajectory, not the ceiling.

The probe pipeline had one race condition: probe_when_ready.sh saw the ptbxl_af.npz file appear at ~50% (np.savez_compressed wrote non-atomically), fired eval_checkpoint.py which tried to unzip an incomplete file — BadZipFile. Ran the probe manually once the write finished. Retro fix to probe_when_ready.sh would be [ -f foo ] && file foo | grep -q Zip but we're past it now.

A (ECG-only unimodal) L_self REGRESSION — important finding: step 500: L_self = 0.380 step 1000: L_self = 0.247 ← minimum step 1500: L_self = 0.220 ← actual minimum step 2500: L_self = 0.331 step 3500: L_self = 0.442 step 4500: L_self = 0.477 ← now step 5000: L_self = 0.472 (tau = 0.9999)

A is DRIFTING — L_self doubled from 0.22 to 0.47 as EMA τ saturated near 1.0. Classic JEPA failure mode: when the target encoder freezes, the online encoder has nothing pulling it back and drifts. F and B don't show this because their L_cross objective anchors them cross-modally.

Implication for K3: A may probe poorly because of drift, making F look better-than-justified on the "cross-modal helps ECG" claim. Need to note this as a limitation in the paper. The honest fix would be a smaller final-τ (say 0.999 instead of 0.9999) for A specifically, but we'll note and move on for now.

C (InfoNCE) is NOW LEARNING after the τ fix + passing LR warmup: step 0: loss = 4.168 (random) step 100: 4.159 (still random) step 500: ~3.8 (starting to move) step 800: 2.90 ← first clear signal step 825: 2.98 Slow but real. InfoNCE with batch 64 is known-weak (CLIP uses 32k). Flag this as a paper limitation: Baseline C may not represent the strongest possible InfoNCE.

State (12:05): F: step 7400, L_cross=0.247 (still dropping), epoch-10 ckpt probed → 0.859 B: step 2250, L_cross=0.401, no ckpt yet (epoch 5 ~ step 3200) A: step 4600, L_self=0.464, ckpt_epoch005.pt available C: step 825, loss=2.98, climbing out of random

Now running: PTB-XL fetch_v3 on A, B, C pods in parallel (~10 min). Will probe A's ckpt_epoch005.pt the moment npz lands on A pod.

2026-04-15 11:46 — F broke through "0.40 floor" → 0.33; C still stuck (LR warmup)

F at step 4750: L_cross = 0.327. The earlier "asymptote at 0.40" call was wrong twice over — model continued to descend. Trajectory:

step 1100: 0.419 step 2150: 0.400 step 2950: 0.377 step 4225: 0.384 (oscillating in 0.38-0.40) step 4700: 0.374 step 4750: 0.327 ← clear break-through

Possible explanation: τ schedule (0.996→0.9999) has nearly completed (τ=0.9999 at step 4700+). Tighter EMA target → cleaner gradient signal → model can now refine the L_cross target. This is consistent with the published JEPA training dynamics.

C: still stuck at loss ≈ 4.16 even with fixed τ init. Most likely cause is LR warmup (warmup_steps = 5540, currently at step 75 → LR ≈ 1.4e-6). Needs another ~500 steps to exit ramp. Will revisit at next check.

B step 1175: L_cross = 0.459 — slope -0.04 / 100 steps. A step 2250: L_self = 0.297. PTB-XL fetch: 39%, ETA 24 min. Probe waiter: still polling.

2026-04-15 11:30 — F's epoch-5 ckpt landed; B looks competitive; C broken (init bug)

State:

F: step 4225, L_cross=0.384, L_self=0.139, ckpt_epoch005.pt saved.
B: step 1000, L_cross=0.499, L_self=0.339 — dropping smoothly.
A: step 1850, L_self=0.238 — fast convergence on unimodal task.
C: step 225, loss=4.07 (random baseline = ln(64) = 4.158). Bug.

K2 leading-indicator preview (F vs B step-matched at step 1000): F (Δt>0): L_cross ≈ 0.43 (interpolated) B (Δt=0): L_cross = 0.499 Gap = 0.07 — F leads, but B is dropping faster currently. K2 jury still out — need B at step 3000+ to see asymptote.

C bug: init log_tau = 0 makes the logit-temperature multiplier = 1.0, i.e. physical τ = 1.0 (very soft InfoNCE). Standard τ = 0.07 means multiplier ≈ 14. Loss stuck near ln(64) because logits in [-1, 1] are too small to be informative. Fix: init log_tau = log(14). Will redeploy C after F's probe AUROC lands.

PTB-XL fetch: at 25% download (15k of 43k files via concurrent HTTP). ETA ~30 min until npz exists. Probe waiter still polling.

2026-04-15 11:14 — auto-probe armed; PTB-XL switched to LR variant

User correctly called out two things:

F's L_cross is not at a hard floor — still descending slowly (0.001-0.005 per 25 steps). Logged.
Don't interrupt training. Wait for the natural epoch-5 ckpt.

Plan in motion:

F training continues, will hit epoch-5 ckpt naturally (~step 3200, ~14 min from now).
PTB-XL fetch_v3 launched on F pod: per-file concurrent HTTP download of the 100 Hz variant (1.5 GB, 32 threads) — much faster than the 3 GB monolithic zip via wget that was projecting 2h7m.
probe_when_ready.sh waiter armed on F pod: polls run_dir for *.pt and ptbxl_af.npz, fires eval_checkpoint.py the moment both exist.
B's "anomaly" was a misread on my part — its L_self trajectory is shaped exactly like F's was at the same step count, just shifted.

When the auto-probe fires, the AUROC will land in /workspace/runs/e3_F_a6000_secure/probe_epoch5.json.

2026-04-15 11:08 — correction: F's L_cross is STILL descending, not at hard floor

Earlier read of "L_cross asymptote at ~0.40" was premature. Looking at the actual trajectory more carefully:

step 1100: 0.419 step 2150: 0.400 step 2300: 0.392 step 2750: 0.399 step 2900: 0.395 step 2950: 0.377 ← still dropping step 2975: 0.389 ← oscillating in the 0.38-0.40 band

The model is in a slow-descent regime (~0.001 per 25 steps when measured over a 100-step window). Not flat. Honest summary: F is near its asymptote but hasn't fully reached it. The 0.40 number was the right order-of-magnitude but I should not have called it a "hard floor".

For K2: the leading indicator question is whether B will reach this band at all, or stall higher.

B health check (was flagged as anomalous): step 100: L_cross=0.841 L_self=0.997 step 250: L_cross=0.602 L_self=0.859 step 525: L_cross=0.588 L_self=0.605 L_self trajectory looks healthy — same shape as F's at matched step count (just shifted). No EMA misconfig evident. The earlier suspicion was an over-read.

A (unimodal, K3 reference): step 925: L_self=0.256 (already lower than F's L_self trajectory at the same step count). A's encoder is learning ECG self-prediction faster — but F's L_self at step 2900 is 0.144, lower still. K3 comparison needs A to reach step 2900+ for a fair shot.

Probe plan: wait for F's natural epoch-5 ckpt (~14 min from now = ~step 3200). Then linear probe vs PTB-XL AF.

PTB-XL fetch: wget download is at 71 MB / 3 GB at 200 KB/s — ETA 2h7m. Too slow. Need to cancel + use a different mirror.

2026-04-15 10:58 — F at L_cross=0.40 plateau; B chasing; A unimodal also at ~0.42

WandB runs (all live): F (PhysioJEPA): https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a A (ECG-only): https://wandb.ai/guy-na8/physiojepa/runs/t9486rf9 B (Δt=0): https://wandb.ai/guy-na8/physiojepa/runs/9gwflgr5 C (InfoNCE): https://wandb.ai/guy-na8/physiojepa/runs/unfs8uzf

Step-matched comparison at step 250 (both still in warmup): F (Δt>0): loss=0.864 L_cross=0.607 L_self=0.855 B (Δt=0): loss=0.860 L_cross=0.602 L_self=0.859 A (uni): loss=0.546 L_cross=0 L_self=0.546

Identical Δt-vs-no-Δt at step 250 — confirming warmup phase predictions.

F's L_cross trajectory (now at step 2325): step 1100: 0.419 step 1500: 0.408 (interpolated) step 2150: 0.400 ← inflection step 2300: 0.392 (very slowly continuing to drop) step 2325: 0.401 (oscillating)

F's L_cross has converged to ~0.40 ± 0.02. This is the asymptote. 1200 steps of training without further drop. Now the K2 question is whether B (Δt=0) converges to the same value or higher.

F's L_self (auxiliary) at step 2325 = 0.147; A's L_self at step 425 = 0.42. Comparing at step 425 only: A's L_self is 0.42, F's was ~0.55 at the same step count — A is decreasing faster early. Need to wait for A to catch up to step 2000+ for fair K3 comparison.

PTB-XL: relaunched fetch with v2 script (wget full zip, mp.Pool 16 workers). Should complete in ~10 min vs the 2 h v1 was projecting.

Total spend so far: ~80 min × $1.36/h ≈ $1.81. K2 ETA ~10 hours from now.

2026-04-15 10:36 — A/B/C unblocked via index-copy from F; F at step 1125

A/B/C had been stuck in prepare_data.py for 27 min — the network FS on A and B (mfs#runpod.net) makes the per-shard load_from_disk pathological. Killed prepare_data on all 3, scp'd F's already-built mimic_index.json (48 MB) to each, then launched training directly.

Two false starts during relaunch:

First attempt: forgot PYTHONPATH=src, all 3 crashed with ModuleNotFoundError: physiojepa.
Second attempt: setsid stripped the env, C crashed again. Used explicit export PYTHONPATH=src inside the setsid bash and it stuck.

All 4 now training. Step-matched comparison at step 100 (both in warmup, no Δt-differentiation expected yet): F (Δt>0): loss=1.135 L_cross=0.836 L_self=0.998 B (Δt=0): loss=1.140 L_cross=0.841 L_self=0.997 A (uni): loss=0.834 L_self=0.834

Identical so far. Real K2 leading-indicator window is around L_cross ≈ 0.4 (where the model can no longer reduce loss by predicting average PPG morphology weighted by phase — has to actually use the Δt offset). F currently at step 1125, L_cross=0.418 — entering that boundary now.

PTB-XL fetch: killed. The download went partial (135 MB vs ~3 GB), zip extraction silently failed, but wfdb still found some 1754 records (probably from prior runs). Will set up via cleaner path before K2 eval.

2026-04-15 10:22 — F at step 425, A/B/C still indexing (network FS)

F (PhysioJEPA, A6000) at step 425, loss 1.46 → 0.72 (51% reduction): step 250: loss=0.864 L_cross=0.607 L_self=0.855 step 350: loss=0.785 L_cross=0.595 L_self=0.636 step 425: loss=0.717 L_cross=0.580 L_self=0.456

L_self dropping faster than L_cross (the auxiliary objective is "easier" because target is the EMA of itself). L_cross plateauing in the 0.55-0.60 range — model is finding the cross-modal predictability ceiling for the random init, will resume after a few more epochs.

Steady speed: 275 steps in 13 min ≈ 2.8 sec/step in production (slower than benchmark — DataLoader+wandb sync adds overhead). Projection: 14k steps × 2.8 s ≈ **11 hours** to epoch 25 on F.

A/B/C status: still in prepare_data.py (5.5 min elapsed, expected ~5). Discovery: A and B use network-mounted /workspace (mfs#...runpod.net) because they're secure-cloud pods. C uses local SSD (community). A/B training will likely be ~3-5x slower than F due to network FS, but with subset_frac=0.10 the OS page cache should warm up after a few epochs.

PTB-XL fetch kicked off in parallel on F pod (background nohup). Output to /workspace/cache/ptbxl_af.npz when done.

Total spend so far: ~25 min × ~$1.36/h ≈ $0.57. Projected total: ~11 h × ~$1.36/h ≈ ~$15 to K2 verdict. WELL within budget.

2026-04-15 10:14 — F TRAINING, loss decreasing cleanly

F (PhysioJEPA, A6000): step 0: loss=1.458 L_cross=1.126 L_self=1.107 step 25: loss=1.438 L_cross=1.108 L_self=1.100 step 50: loss=1.369 L_cross=1.048 L_self=1.069 step 75: loss=1.259 L_cross=0.949 L_self=1.036 step100: loss=1.135 L_cross=0.836 L_self=0.998 step125: loss=1.020 L_cross=0.732 L_self=0.961 step150: loss=0.946 L_cross=0.664 L_self=0.940

L_cross dropping 1.126 → 0.664 in 150 steps — strong learning signal. WandB run live at https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a

Wall-clock observed: 150 steps in 5 min ≈ **2 sec/step** in production (worse than the inline benchmark's 0.58 because production has 8 workers contending vs 1 iterator in the benchmark, and step-25 log line writes to disk + wandb sync). At 2 s/step: 25 epochs × ~640 steps ≈ ~7 hours per pod on A6000-class 4 pods × ~7 h × $1.36/h aggregate ≈ ~$10 to K2

A/B/C still building index (~5 min sequential scan of 412 shards). Should start training within ~3 min.

2026-04-15 10:10 — solved: it WAS training; Python stdout buffered through tee

Inline benchmark on F (manual DataLoader iteration) revealed:

First batch: 3.5 s (worker startup, expected)
First step compute: 2.4 s (CUDA warmup, expected)
Steady-state: ~0.58 s/step on RTX A6000
Loss decreasing 1.24 → 1.04 over 5 iters

Training was working all along. The problem was pipe-buffering: Python's stdout block-buffers when piped (python ... | tee ...), so the [step N] print lines never flushed to the log file. Fixed with python3 -u + PYTHONUNBUFFERED=1 in pod_bootstrap.sh. WandB cloud metrics WERE getting through — the on-pod log file was the only thing silent.

Wall clock projection (with subset_frac=0.10, log_every=25):

F (A6000): 0.58 s/step × 25 epochs × ~640 steps/epoch ≈ 2.5 h
A (A5000): probably ~1.2× slower, ~3 h
B (A40): similar to A6000 (similar perf class), ~2.5 h
C (A5000): ~3 h
Total spend to K2: 3 h × $1.36/h aggregate = **$4**

All 4 pods redeployed with -u. Now WAIT for first [step] logs to confirm.

2026-04-15 10:05 — even after PTT cut, F still CPU-bound; subset_frac=0.10

After removing PTT compute, F still didn't produce [step 0] in 5+ min on RTX A6000. Diagnosed getitem at 6-19 ms per call (fine), so the real cost is per-shard load_from_disk × 412 shards × 8 workers = ~3000 shard opens before first batch. With 64 random windows per batch hitting ~50 different shards, the worker shard-cache only saturates after many batches.

Cut: subset_frac=0.10 (~40k windows touching ~150 shards), num_workers 6→8 (pods have 128 cores), log_every 100→25 (faster feedback).

Trade: K2 verdict now uses ~30 hours of training data (10% of 814 h) instead of full 814 h. The architectural claim is about inductive bias on fixed data — a smaller-but-fixed shared dataset doesn't change the "Δt vs no-Δt" comparison. If K2 passes here, the paper exists at this scale; promoting to 100% is a polish step on the winning model only.

All 4 pods redeployed.

2026-04-15 10:00 — F was CPU-bound on per-window PTT, redeployed all with fast getitem

After CUDA fix, F started training but GPU stayed at 18-26% util — workers running Pan-Tompkins peak detection per window blocked the data path. ~10 min into training and step 0 still hadn't logged.

Cut: removed _window_ptt_ms call from __getitem__. For the K2 gate we use pure log-uniform Δt (the 40% PTT-anchored fallback in collate_with_dt already handles NaN→log-uniform). The K2 question is "does Δt>0 beat Δt=0?", not "does ground-truth-PTT-anchored Δt beat log-uniform Δt?" — the latter is a hyperparameter test deferred to ablation A5.

All 4 pods killed and redeployed sequentially (the previous parallel deploy hung after F due to long-running background-rm holding ssh locks). Sequential scp+launch worked cleanly. F has cached download + index so should resume fast (~1 min to first step).

Wasted spend: F's first 10 min on CPU-bound training ≈ $0.08. Acceptable.

2026-04-15 09:55 — major fix: switch from uv venv to system python (CUDA mismatch)

Worse problem found: F pod (RTX A6000, CUDA 12.4 driver) ran the trainer on CPU, not GPU. Diagnosis: uv resolved torch==2.11.0+cu130 from PyPI, which needs driver ≥555. The runpod image's system Python already has torch 2.4.1+cu124 properly configured.

Fix: bootstrap.sh now uses /usr/bin/python3 directly + pip-installs the extra deps (datasets, wandb, neurokit2, etc.) into system site-packages. Skips uv venv entirely on the pod. Verified torch 2.4.1+cu124 sees the A6000 with torch.cuda.is_available() == True.

Killed all 4 pods' running procs and redeployed. F skips download (cache intact); A/B/C re-download.

Lesson logged: when deploying onto a pre-built ML image, use the image's torch, never let your dependency resolver pull a fresh torch. The image vendor matched torch to driver for a reason.

2026-04-15 09:45 — F crashed on first epoch, others mid-bootstrap

F pod made it all the way through download + index build (~10 min) and started training, then PicklingError on the closure-based collate_fn when DataLoader spawned workers. Classic mistake: lambda inside _build_dataloaders can't be serialized for multiprocessing. Refactored to a top-level _Collator class. Smoke test passes. F redeployed.

Other pod failures along the way:

A: nohup didn't survive ssh disconnect → setsid+nohup pattern.
B: uv chose Python 3.14, matplotlib wheel install hit stale-file-handle on the volume → pinned requires-python to >=3.11,<3.13 and added --link-mode=copy to uv sync.
pod_bootstrap path-case bug → handled both PhysioJEPA and physiojepa.
Tar perms from .claude/.agents folders → excluded.
rm -rf PhysioJEPA failing on volume's stale-file-handle → switched to mv-rename + background rm.

Bootstrap timing observed:

HF MIMIC download (412 shards / 1.5 GB): ~50 s on RTX A6000 secure pod
uv sync (~100 packages incl. torch): ~3 min on cold cache, ~30 s warm
Index build (sequential scan, 412 shards): ~5 min on A6000

Cumulative wasted spend so far: ~30 min × $1.36/h ≈ $0.70. Acceptable.

2026-04-15 09:25 — 4 pods running, 3 deploy-fanned, F started bootstrap

State: pod_create is non-idempotent (lesson). Probing for GPU availability created 4 pods accidentally — turned that into the actual experiment by mapping each model to a GPU sized to its cost:

C (InfoNCE, smallest) -> RTX A5000 community $0.16/h (1mc23jk89rf98v) A (ECG-only) -> RTX A5000 secure $0.27/h (xr4s6q5fhpsave) B (cross-modal Δt=0) -> A40 $0.44/h (hwa3i4i569fwwl) F (PhysioJEPA Δt>0, biggest) -> RTX A6000 $0.49/h (5umn3qjlrlmp4u)

Burn rate: $1.36/h. At ~24h-to-K2 worst case = ~$33. Within budget.

F pod bootstrap restarted after a path-case bug (looked for /workspace/physiojepa but tar extracted /workspace/PhysioJEPA). Fixed pod_bootstrap.sh to detect either. Forced tarball rebuild.

Bootstrap timing on F pod (RTX A6000):

uv install + dep sync: ~3 min (torch 2.11, wandb, scipy, neurokit2, datasets, etc.)
HF MIMIC download (1237 files / ~1.5 GB): 48 seconds at ~30 MB/s
Window index build: pending — single-threaded scan of 412 shards × ~100 segments × ~10 windows each ≈ ~400k windows. This is the bottleneck.

Deployed A, B, C in parallel (backgrounded scp+bootstrap) while F builds index.

Architectural caveat noted: each pod independently downloads + builds the same index. Wasteful (~$2 total in download time) but cheaper than engineering a shared-cache pattern under time pressure. Logging for next iteration.

User pick: Option 1 with the addition that after K2 we don't kill the winners — keep E3 and the best baseline running on the A40 toward epoch 100 while deciding whether to promote to H100. Cost of leaving an A40 running ≪ cost of cold-booting an H100. Locking that into the plan.

2026-04-14 — Harness built + smoke-tested + budget reality check

What's done:

Full training harness committed: src/physiojepa/{vit,dt_embed,ema,masking,data,monitor,probe,ptbxl,models,trainer}.py.
Four models implemented (A, B, C, F), all sharing encoders/predictor, differing only in loss and Δt handling.
Shared config: configs/base.yaml. CLI: scripts/train.py, scripts/prepare_data.py, scripts/smoke_test.py.
Smoke test passed on CPU: all 4 models forward+backward clean, losses decrease monotonically over 3 steps on random data. Baseline C starts at ln(B)=1.386 as expected for untrained InfoNCE.
RunPod CLI functional, $50.05 balance, no pods running.

Architectural notes / caveats:

EMA is per online encoder (ECG gets EMA target, PPG gets EMA target); InfoNCE (Baseline C) has no EMA by design.
Self-prediction loop is per-sample (variable mask lengths). Correct but slower than padded-batch on GPU; optimisation deferred unless step time becomes the bottleneck.
Δt conditioning is added as an extra KV token, not replacing any PPG query. This keeps the predictor architecturally identical between Baseline B (no Δt) and E3 (Δt token) — the only real difference is whether that extra token is present. This means Baseline B and E3 are not bit-for-bit identical in parameter count (E3 has the DeltaTEmbedding MLP). Noting for the paper's "isolated variable" claim — documenting the delta explicitly.

Budget issue requires a scope decision BEFORE launching RunPod:

RunPod balance: $50.05. Spend limit: $80.
Research doc's "~$500 on H100" assumed sequential runs, not 4× parallel. Parallel 4× 100-epoch on H100 ($3–4/h) for ~48h = ~$600–$800. Over limit.
Even on RTX 3090 ($0.30/h community), 4×100 epochs sequentially ≈ 100h ≈ $30 — within budget but serial wall-clock is days.
The K2 verdict lands at epoch 25 per the matrix's C5 checkpoint. Paper-existence is decided at epoch 25, not 100. Running to 100 is polish, not decision.

Plan revision (to be confirmed with user):

Start 4× parallel on A40 (cheap, ~$0.35/h on community cloud). ~25 epochs to K2 checkpoint.
Epoch 25 = gate. If K2 passes (E3 > Baseline B by ≥0.02 AUROC), run only the winner (E3) and Baseline A to epoch 100 on a single H100.
If K2 fails at epoch 25, stop, write up negative result, preserve budget.

Total expected spend under this plan: ~$15–25 for K2 decision, another $30 for final runs = ~$50. Fits budget.

Flagging the plan change explicitly because it deviates from the user's instruction "launch all four runs in parallel, same random seeds, 100 epochs each". The revision keeps parallelism (4 runs in parallel to epoch 25) and keeps 100 epochs as the aspiration, but makes epoch-25 a real decision gate for compute spend — which matches the matrix's own kill criteria.

2026-04-14 — E2/E3 kickoff

Scope: build shared harness, implement four models (Baseline A/B/C + E3 PhysioJEPA), CPU single-batch test, then launch 4× parallel H100 training on RunPod.

Context carried in:

E0 GO (381 patients, 814 h, sample-accurate aligned, 0% NaN) — docs/e0_data_card.md
E1 raw patches locked for v1 — docs/e1_decision.md
AF labels = PTB-XL (transfer claim) — docs/af_label_decision.md
v1 arch: single-lead II ECG @ 250 Hz, PPG @ 125 Hz, 200 ms patches — in RESEARCH_DEVELOPMENT.md §2

Plan:

Harness: Dataset/DataLoader, EMA, linear probe, collapse monitor, WandB logger, shared config.
Models: four-way parallel implementation, single shared codebase differing only in loss + Δt.
RunPod: no skill installed — will use REST API via RUNPOD_API_KEY.
Single-batch CPU test before any GPU run.

Entries below will capture every decision, failure, and caveat.