File size: 43,301 Bytes

31e2456

# PhysioJEPA research log
*Running narrative — newest entries at top.*

Format: each entry is `## YYYY-MM-DD HH:MM — [PHASE] — topic` followed by bullet list of what was done, what was found, and any decisions/caveats.

---

## 2026-04-16 09:35 — definitive run: all 3 pods bootstrapping

All 3 definitive-run pods deployed:

  F: H100 PCIe secure ($2.39/h) @ 216.81.245.97:18654 — still in index build
  A: A100 SXM comm    ($1.39/h) @ 216.249.100.66:20011 — in precompute (454k windows)
  B: A100 SXM secure  ($1.49/h) @ 154.54.102.26:17999 — just started pip install

Config: 100 epochs, full data (subset_frac=1.0 via fast_cache_dir mmap),
mask_ratio=0.75, batch_size=64, seed=42, num_workers=12.

Aggregate: $5.27/h. Balance: $118.90. At 20h projected = $105.

Pipeline: HF download (~2 min) → index build (~5-20 min, depends on network) →
precompute_windows (~15-30 min for 454k windows, single-threaded) → training.

A is furthest along (precompute started). F is behind (slower download).
B just started. First [step 0] expected in ~30 min from A.

## 2026-04-16 04:40 — full-scale run scoping: need data pipeline optimization first

User requested 3× H100, full data, 100 epochs, mask=0.75. Budget check:
- Balance: $118.90. H100 PCIe community: $1.99/h × 3 = $5.97/h.
- Steps: ~6160/epoch × 100 = 616k per run.
- sec/step on A40 was 2.8 (production) vs 0.58 (benchmark). Even on H100
  with faster CPU, realistic production sec/step is ~1.0-1.5.
- At 1.2 sec/step: 616k × 1.2 / 3600 = 205h per run × 3 runs × $2/h = $1230. WAY over budget.

Root cause: __getitem__ calls load_from_disk per-shard + bandpass + zscore
per window at runtime. This dominates training time by 5× over GPU forward.

Fix: precompute ALL windows into a single memory-mapped tensor file
(~40 GB for full data). __getitem__ becomes a single mmap read (~0.1ms).
sec/step drops to ~0.3, bringing total runtime to ~51h across 3 A100 runs
= ~$100. Fits budget.

Building the precompute script now.

## 2026-04-16 04:25 — FINAL: abl3 ep25 = 0.848, all pods killed

**abl3 (mask=0.75, unimodal A) epoch 25 AUROC = 0.848.**

Complete results table:

| Model            | mask | L_self peak | ep5   | ep10  | ep15  | ep20  | ep25  |
|------------------|------|-------------|-------|-------|-------|-------|-------|
| original A       | 0.50 |   0.476     | 0.783 | 0.736 |   —   |   —   | 0.703 |
| abl1 (pd=1)      | 0.50 |   0.438     |   —   |   —   | 0.749 |   —   |   —   |
| abl2 (sin-q)     | 0.50 |   0.559     |   —   |   —   | 0.784 |   —   |   —   |
| **abl3 (m=75)**  | **0.75** | **0.200** |  —   |  —   | 0.838 | 0.845 | **0.848** |
| abl4 (full data) | 0.50 |  0.587+     |   —   |   —   |   —   |   —   | (killed; spike confirmed) |
| B (Δt=0)         |  —   |     —       | 0.660 | 0.844 |   —   |   —   | 0.847 |
| F (Δt>0)         |  —   |     —       | 0.652 | 0.859 |   —   |   —   | 0.835 |

**abl3 (0.848) ≈ B (0.847).** Unimodal JEPA with 75% masking exactly
matches cross-modal JEPA. The mechanism story is complete.

abl4 (full data, 50% mask) showed L_self spike peaking at 0.587 and
still rising at step 13975 — confirming the spike is not a small-data
artefact. Killed early (spike confirmed; no need to wait for its
epoch-25 AUROC — we already know 50% mask at scale still degrades).

All pods killed. Zero stale compute. Total ablation spend: ~$4.50.

## 2026-04-16 03:10 — AUROC confirms mechanism end-to-end

Epoch-15 AUROC on PTB-XL AF:

| variant         | L_self peak | AUROC @ ep15 |
|-----------------|-------------|--------------|
| original A      |   0.476     |   0.736      |
| abl1 (pd=1)     |   0.438     |   0.749      |
| abl2 (sin-q)    |   0.559     |   0.784      |
| **abl3 (m=75)** |  **0.196**  | **0.838**    |
| (ref) B ep10    |     —       |   0.844      |
| (ref) F ep10    |     —       |   0.859      |

**abl3 matches B/F's AUROC at epoch 15.** Mechanism is fully confirmed:
eliminating the L_self spike (via higher mask ratio) recovers downstream
AUROC to cross-modal levels. Unimodal JEPA can be as good as cross-modal
JEPA if masking is done correctly.

Subtle finding from abl2: sinusoidal query has a LARGER L_self spike
(0.559 vs orig 0.476) but HIGHER AUROC (0.784 vs 0.736). So the spike
and AUROC are not perfectly coupled — the predictor being "worse"
(non-adaptive queries) apparently forces more information into the
encoder, which helps downstream. Noting as an interesting secondary
finding, but abl3 is the main story.

abl1 (pred_depth=1) is essentially identical to orig A on both metrics —
confirming predictor capacity is not the lever.

### Paper now has a clean, precise story

1. Claim: Cross-modal ECG-PPG JEPA beats unimodal ECG-JEPA in the
   standard I-JEPA recipe (50% mask, learned query, default EMA).
2. Mechanism: at 50% mask the predictor finds a local-interpolation
   shortcut (25 visible context ↔ 25 target contiguous blocks → linear
   blend of adjacent patches works). Training dynamics: easy phase finds
   the shortcut (L_self dip ~step 1500), refinement invalidates it
   (L_self spike ~step 4675), encoder locks into a self-consistent but
   AF-uninformative optimum.
3. Fixes: (a) mask ratio 0.75 denies the shortcut structurally — abl3
   matches cross-modal AUROC. (b) Cross-modal prediction is the same
   mechanism — 0% PPG visible context → no interpolation path — F and B
   both stable.
4. Δt direction doesn't matter (K2 fail is a negative result that
   supports the mechanism: the Δt token is a tiny perturbation of the
   predictor's query set; what matters is whether interpolation is
   available, not where the targets sit on the time axis).

Actionable recommendation: ECG-JEPA (Weimann & Conrad) used 50% masking.
75% masking is a likely-free improvement, testable on PTB-XL directly.

### Status

- abl1 + abl2 pods killed. Answered their questions.
- abl3 running to epoch 25 for the final number. ~1 h left at $0.44/h.
- abl4 (full data) at step 9975 with L_self=0.54 — **spike IS present
  at full data**, just delayed. More data slows shortcut discovery but
  doesn't eliminate it. Confirms mask ratio is the architectural fix,
  not a small-data artifact.
- abl4 still has ~20h to go. Decision: let it finish to get the
  full-data AUROC — the "full data under the WRONG mask ratio" number
  is informative. At $0.44/h × 20h = $8.80. Still well under budget.

## 2026-04-16 02:05 — mask_ratio IS the lever (spike window confirmed)

Full matrix at the critical spike window (original A peaks L_self=0.476 at step 4675):

  step  | orig A | abl1 (pd=1) | abl2 (sin-q) | **abl3 (m=75)** | abl4 (full)
  ------+--------+-------------+--------------+-----------------+------------
   1475 | 0.220  |   0.222     |   0.329      |   **0.146**     |  0.296
   2475 | 0.340  |   0.339     |   0.482      |   **0.165**     |  0.233
   3475 | 0.442  |   0.420     |   0.555      |   **0.186**     |  0.208
   4475 | 0.476  |   0.438     |   0.559      |   **0.196**     |  0.260
   4975 | 0.475  |   0.398     |   0.551      |   **0.200**     |  0.287
   5475 |  —     |   0.334     |   0.512      |   —             |  0.313

**abl3 (mask 0.75) has NO spike.** L_self rises monotonically from 0.146
(step 1475) to 0.200 (step 4975) — a gentle climb of +0.05 over 3500 steps,
vs orig A's explosive +0.26 peak.

**abl1 (pred_depth=1) tracks orig A**. Predictor capacity is not the lever.

**abl2 (sinusoidal queries) has a LARGER spike than orig A** (0.559 peak vs
0.476). Removing the adaptive query hurts — the predictor can't route
context tokens to targets it cares about.

**abl4 (full data) shows a muted spike** (0.208 → 0.313 over 2000 steps).
10× data slows shortcut discovery but doesn't eliminate it. Suggests scale
helps but mask_ratio is the cleaner fix.

### Revised mechanism — unified story

50% masking gives the predictor 25 target patches and 25 visible context
patches arranged in contiguous blocks. Early training, the predictor
learns a short-range interpolation shortcut: predict masked patch `p` as
a linear blend of adjacent visible patches. This gives a low L_self quickly
(dip at step 1500). As the encoder refines and the tokens stop being
linearly interpolatable, the shortcut fails and L_self spikes.

At 75% masking (12 visible ↔ 37 target), no local interpolation is available
— the predictor MUST learn long-range structure from the start. No dip,
no rebound.

Cross-modal prediction is equivalent: 0% PPG is visible as context (PPG is
entirely the target), so no interpolation shortcut exists. F and B dodge
the spike by the same mechanism as abl3.

**Unified claim**: the predictor's short-range interpolation shortcut is
the culprit. Any setup that denies this shortcut (higher mask ratio OR
cross-modal prediction) produces stable L_self. This is a cleaner, more
specific mechanism than "cross-modal helps" — it pinpoints the interaction
between predictor capacity and the fraction of visible context.

### Next test: AUROC recovery

Does abl3's no-spike training actually produce better AF representations?
Kicked off PTB-XL fetch on abl3 pod in parallel with training. Will probe
all 4 ablation ckpts once training completes (~2-3 h).

Prediction: if the mechanism story is correct,
  abl3 AUROC @ ep25 > orig A's 0.703, should approach F/B's 0.83-0.85.

## 2026-04-16 01:15 — ablation early signal: abl3 (mask 75%) breaks the pattern

L_self side-by-side at matched steps (only the key ones):

  step  | orig A | abl1(pd=1) | abl2(sin-q) | abl3(m=75) | abl4(full)
  ------+--------+------------+-------------+------------+-----------
    975 |  0.247 |   0.248    |   0.267     |  0.197     |  0.390
   1475 |  0.220 |   0.223    |   0.292     |  0.144     |  0.285 (interp)
   1775 |  0.243 |   0.255    |   0.371     |  0.148     |  0.269
   1975 |  0.256 |   0.269    |   0.403     |  —         |  0.254
   2175 |  0.283 |   0.297    |   0.447     |  —         |  0.230 (interp)

**abl3 (mask 0.75) is markedly different.** L_self at step 1775 is 0.148,
lower than original A's minimum of 0.220. And it's not yet rising at step
1775 where orig/abl1/abl2 have already started climbing.

**abl1 (pred_depth=1) ≈ orig A.** The predictor size was not the driver.

**abl2 (sinusoidal query) is WORSE than orig A.** By step 1775 it's at 0.371
vs orig A at 0.243. Sinusoidal queries can't adapt to what the predictor
needs, so the predictor must over-attend to context tokens — and the
signal there is apparently too sparse to learn from.

**abl4 (full data) is descending monotonically** at step 1975 (L_self=0.254).
Too early to say if it avoids the spike — original A's spike was at step 4675.
Full data is ~10× slower per logical training "epoch" so the spike location
in wall-clock terms shifts late. Continue monitoring.

**Revised mechanism hypothesis**: unimodal JEPA at mask_ratio=0.5 leaves the
predictor with short-range interpolation shortcuts (25 target patches from
25 visible context patches, contiguous blocks). Early training finds these
shortcuts (L_self dips at step 1500). As the encoder refines and
invalidates the shortcuts, L_self rises. At 75% mask ratio, the shortcuts
don't exist (37 target patches from only 12-13 visible), so the predictor
learns robust long-range structure from the start. No dip-and-rebound.

This is mechanism-specific, falsifiable, and explains both:
(a) why F/B didn't drift (cross-modal loss provides a diverse, non-local
    target that can't be locally interpolated)
(b) why abl3 fixed it in unimodal A (higher masking also eliminates the
    local shortcut)

Now the critical follow-up: does abl3's epoch-25 AUROC match F/B (~0.84)?
That would complete the mechanism-to-downstream story.

Cost check: 4×A40×$0.44 × ~45 min = ~$1.32 so far. abl1/2/3 ~3.5 h to go
(~$5). abl4 ~30 h to go (~$13). Total ~$20 for the suite. Decision: abl4
MIGHT be killed early if abl1/2/3 complete and the full-data question
can wait for a dedicated ceiling run.

## 2026-04-16 00:30 — 4 parallel A ablations launched on A40 secure pods

To find the real mechanism behind A's degradation, running 4 ablations
in parallel. Each identical to original A except one variable.

  abl1: pred_depth 4 → 1            (pod 0n8im5mri5hjk0, 69.30.85.78:22121)
  abl2: query_mode learned → sinusoidal (pod a2pye2ki7uvw47, 194.68.245.208:22053)
  abl3: mask_ratio 0.5 → 0.75       (pod jwwln4klav8674, 194.68.245.207:22198)
  abl4: subset_frac 0.10 → 1.00     (pod 4pvp7yb1rmbxta, 194.68.245.207:22197)

All on A40 secure ($0.44/h × 4 = $1.76/h aggregate). 25 epochs each.
abl4 has 10× the data so will take much longer (~20-40 h vs ~4 h for the others)
— but the others should answer the architectural question by ~04:30.

Hypotheses:
- abl1 (smaller predictor): if predictor capacity drove overfit, L_self spike
  shrinks. AUROC may improve.
- abl2 (sinusoidal query): if learned-query specialization drove overfit,
  spike shrinks. AUROC may improve.
- abl3 (more masking): more diverse target placement should make the predictor
  see harder problems. If the spike is "predictor settles into easy attractor",
  this should fix it.
- abl4 (full data): if 10% subset was the culprit, spike disappears at scale.
  If still present, it's an architectural issue independent of data scale.

Spike location to compare against: original A had L_self spike peaking 0.475
at step 4675 (when τ=0.9999).

## 2026-04-15 21:59 — slow-τ A ablation RESULT: hypothesis FALSIFIED, pod killed

Side-by-side L_self at matched steps:

  step  | orig A | slow-τ A | orig τ | slow τ
  ------+--------+----------+--------+--------
   1475 |  0.22  |   0.22   | 0.9969 | 0.9962
   1975 |  0.26  |   0.28   | 0.9974 | 0.9963
   2975 |  0.40  |   0.49   | 0.9988 | 0.9967
   3975 |  0.45  |   0.60   | 0.9997 | 0.9972
   4975 |  0.47  |   0.60   | 0.9999 | 0.9977
   5475 |  0.46  |   0.55   | 0.9999 | 0.9979

Slow-τ A's L_self rose MORE than original A's, not less, despite τ being
well below saturation through the critical window. The "τ saturation
amplifies the L_self spike" hypothesis is falsified.

The L_self rise must be driven by something else. Top candidates:
1. Masking strategy (multi-block 50% ratio) + small data regime — the
   predictor overfits to easy target patches early (dip at step 1500),
   then the distribution of hard targets dominates as the encoder refines.
2. Query-embedding parameter specialization — the learnable query tokens
   narrow predictive scope, and random target placement starts hitting
   targets they can't handle.
3. Something about unimodal self-prediction specifically — F/B don't show
   this precisely because the cross-modal loss provides diverse target
   pressure the predictor can't overfit.

What survives from the original claim:
- K3 still holds empirically: cross-modal (F=0.835, B=0.847) >> unimodal
  (A=0.703) at epoch 25.
- The mechanism story needs replacing. "Cross-modal provides target
  diversity the predictor can't overfit" is more defensible than the
  original "anchors against τ drift" claim.

Pod y27osaqv7amz7d killed. Ablation cost: ~$0.35 for ~2 h on A5000 community.

Impact on user's plan:
- Conditional was: if spike disappears → full-data B run. Spike did not
  disappear. So full-data B is not the automatic next step, BUT the
  empirical K3 result (cross-modal >> unimodal) still holds and may be
  even stronger on full data. Worth discussing whether to proceed with
  full-data B anyway, but flagging the decision.

## 2026-04-15 21:19 — slow-τ A ablation training (early signal: L_self rising even pre-τ-saturation)

Slow-τ A early trajectory (log_every=25):
  step    0: L_self = 1.167 (random init)
  step  475: L_self = 0.390
  step  975: L_self = 0.247
  step 1475: L_self = 0.223   ← minimum
  step 1975: L_self = 0.282
  step 2175: L_self = 0.313   ← rising, tau still only 0.9963

Original A at comparable steps (before any spike):
  step  500: L_self = 0.380
  step 1000: L_self = 0.247
  step 1500: L_self = 0.220   ← minimum
  step 2000: L_self = 0.258
  step 2225: L_self = 0.283

Slow-τ A is tracking original A essentially step-for-step so far. Both hit
their minimum ~step 1500, both starting to rise by step 2000. **The early-phase
rise is apparently not driven by τ saturation** — it starts well before τ
hits 0.999.

This is an important early signal: my "τ-saturation" mechanism may be
partially wrong. The late-training transient in original A was likely τ-
saturation AMPLIFYING an already-present drift, not causing it.

Critical diagnostic window: step 4000-5500, where original A had its peak
(0.48 at step 4675). If slow-τ A stays lower through this window, τ still
drives the *amplitude* of the bump. If slow-τ A also spikes at step 4675,
τ is not the driver.

## 2026-04-15 20:20 — slow-τ A ablation launched

Ablation pod: y27osaqv7amz7d (RTX A5000 community, FR). Config:
  ema_end = 0.999 (vs 0.9999 in original)
  ema_warmup_frac = 0.60 (vs 0.30 in original)
  everything else identical: subset_frac=0.10, bs=64, 25 epochs, seed=42

Prediction:
- If A spike at step 4675 disappears + AUROC recovers to ~0.84 → τ-saturation
  mechanism is confirmed, cross-modal anchor story holds.
- If spike disappears BUT AUROC stays at ~0.70 → the original A's problem
  wasn't τ saturation per se; the unimodal objective just doesn't contain
  enough AF-discriminative signal at this data scale.
- If spike still present → τ schedule isn't the lever; something deeper.

Conditional on spike disappearing + AUROC recovering, next step is the
full-data B run (100 epochs, H100, 814h) — the ceiling measurement.

## 2026-04-15 20:00 — refined mechanism for A degradation (not monotonic drift)

After pulling full WandB curves, correcting my earlier "A drifts monotonically"
claim. A actually has:

  - L_self minimum at step 1500 (value 0.22)
  - τ-saturation TRANSIENT at step 4675 (value 0.475) — 3× the bump F/B show
  - recovery by step 7400 (value 0.20)
  - late-training slow climb to 0.20 at step 15350

**F and B also show late-training L_self rise** (0.15 → 0.27). Only the
mid-training transient is unique to A.

Key finding: A's loss *recovers* but AUROC *doesn't*. AUROC dropped from
0.783 (ep5) → 0.703 (ep25) even though final L_self is comparable to F/B.
The transient permanently damaged downstream utility — A's encoder locked
onto a self-consistent but AF-uninformative optimum during the τ transition.

Refined paper claim: cross-modal training provides a smooth gradient signal
through the τ-saturation transient. Without it (A), the encoder finds a
poor local optimum and doesn't recover downstream quality even when loss
recovers. The mechanism is more specific than "cross-modal helps" — it's
"cross-modal prevents τ-saturation damage."

## 2026-04-15 19:30 — FULL K-gate results: K2 FAIL, K3 PASS

All 4 pods ran to epoch 25. Full probe matrix on PTB-XL AF:

| Model | ep5 | ep10 | ep25 |
|-------|-----|------|------|
| F (Δt>0) | 0.6521 | 0.8586 | 0.8352 |
| B (Δt=0) | 0.6599 | 0.8440 | 0.8467 |
| A (uni)  | 0.7832 | 0.7357 | 0.7025 |
| C (InfoNCE) | stuck at ~loss 3.0 — under-tuned baseline, not usable |

**K2 FAIL: F − B = −0.012 at epoch 25 (target was ≥ +0.02).**
**K3 PASS BIG: F − A = +0.133 at epoch 25, and A is DEGRADING.**

Written up in `docs/e2_e3_results.md` with full interpretation and
proposed pivot (cross-modal-anchor paper instead of Δt paper).

Spend total: ~$6.14 across 4 pods × ~4.5 h. Vastly under budget.

Pods still have ckpt_final.pt but training is done. Ready to terminate.

## 2026-04-15 11:55 — FIRST AUROC: F at epoch 10 = 0.859

**F (PhysioJEPA, Δt>0) AUROC on PTB-XL AF detection:**
  epoch 5  (step ~3200): **0.652**
  epoch 10 (step ~6400): **0.859**  ← latest

The jump 0.65 → 0.86 in 5 epochs tells us F is rapidly absorbing AF-relevant
features. Trajectory still climbing — we'd expect further gains by epoch 25.

Framing correction (user call-out): "approaching Weimann 0.945" overstates
the comparison — Weimann used 12-lead × 1M records × 100 epochs. F is
single-lead II × 40k windows × 10 epochs. What matters is the *trajectory*,
not the ceiling.

The probe pipeline had one race condition: probe_when_ready.sh saw the
ptbxl_af.npz file appear at ~50% (np.savez_compressed wrote non-atomically),
fired eval_checkpoint.py which tried to unzip an incomplete file — BadZipFile.
Ran the probe manually once the write finished. Retro fix to
probe_when_ready.sh would be `[ -f foo ] && file foo | grep -q Zip` but
we're past it now.

**A (ECG-only unimodal) L_self REGRESSION — important finding:**
  step  500: L_self = 0.380
  step 1000: L_self = 0.247  ← minimum
  step 1500: L_self = 0.220  ← actual minimum
  step 2500: L_self = 0.331
  step 3500: L_self = 0.442
  step 4500: L_self = 0.477  ← now
  step 5000: L_self = 0.472  (tau = 0.9999)

A is DRIFTING — L_self doubled from 0.22 to 0.47 as EMA τ saturated near 1.0.
Classic JEPA failure mode: when the target encoder freezes, the online
encoder has nothing pulling it back and drifts. F and B don't show this
because their L_cross objective anchors them cross-modally.

Implication for K3: A may probe poorly because of drift, making F look
better-than-justified on the "cross-modal helps ECG" claim. Need to note
this as a limitation in the paper. The honest fix would be a smaller
final-τ (say 0.999 instead of 0.9999) for A specifically, but we'll note
and move on for now.

**C (InfoNCE) is NOW LEARNING** after the τ fix + passing LR warmup:
  step   0: loss = 4.168 (random)
  step 100: 4.159 (still random)
  step 500: ~3.8 (starting to move)
  step 800: 2.90  ← first clear signal
  step 825: 2.98
Slow but real. InfoNCE with batch 64 is known-weak (CLIP uses 32k). Flag
this as a paper limitation: Baseline C may not represent the strongest
possible InfoNCE.

State (12:05):
  F: step 7400, L_cross=0.247 (still dropping), epoch-10 ckpt probed → 0.859
  B: step 2250, L_cross=0.401, no ckpt yet (epoch 5 ~ step 3200)
  A: step 4600, L_self=0.464, ckpt_epoch005.pt available
  C: step 825, loss=2.98, climbing out of random

Now running: PTB-XL fetch_v3 on A, B, C pods in parallel (~10 min).
Will probe A's ckpt_epoch005.pt the moment npz lands on A pod.

## 2026-04-15 11:46 — F broke through "0.40 floor" → 0.33; C still stuck (LR warmup)

F at step 4750: L_cross = **0.327**. The earlier "asymptote at 0.40" call
was wrong twice over — model continued to descend. Trajectory:

  step 1100: 0.419
  step 2150: 0.400
  step 2950: 0.377
  step 4225: 0.384  (oscillating in 0.38-0.40)
  step 4700: 0.374
  step 4750: 0.327  ← clear break-through

Possible explanation: τ schedule (0.996→0.9999) has nearly completed
(τ=0.9999 at step 4700+). Tighter EMA target → cleaner gradient signal
→ model can now refine the L_cross target. This is consistent with
the published JEPA training dynamics.

C: still stuck at loss ≈ 4.16 even with fixed τ init. Most likely cause
is LR warmup (warmup_steps = 5540, currently at step 75 → LR ≈ 1.4e-6).
Needs another ~500 steps to exit ramp. Will revisit at next check.

B step 1175: L_cross = 0.459 — slope -0.04 / 100 steps.
A step 2250: L_self = 0.297.
PTB-XL fetch: 39%, ETA 24 min.
Probe waiter: still polling.

## 2026-04-15 11:30 — F's epoch-5 ckpt landed; B looks competitive; C broken (init bug)

State:
- F: step 4225, L_cross=0.384, L_self=0.139, ckpt_epoch005.pt saved.
- B: step 1000, L_cross=0.499, L_self=0.339 — dropping smoothly.
- A: step 1850, L_self=0.238 — fast convergence on unimodal task.
- C: step 225, loss=4.07 (random baseline = ln(64) = 4.158). **Bug**.

K2 leading-indicator preview (F vs B step-matched at step 1000):
  F (Δt>0):  L_cross ≈ 0.43 (interpolated)
  B (Δt=0):  L_cross = 0.499
  Gap = 0.07 — F leads, but B is dropping faster currently.
  K2 jury still out — need B at step 3000+ to see asymptote.

C bug: init `log_tau = 0` makes the logit-temperature multiplier = 1.0,
i.e. physical τ = 1.0 (very soft InfoNCE). Standard τ = 0.07 means
multiplier ≈ 14. Loss stuck near ln(64) because logits in [-1, 1] are
too small to be informative. Fix: init `log_tau = log(14)`. Will redeploy
C after F's probe AUROC lands.

PTB-XL fetch: at 25% download (15k of 43k files via concurrent HTTP).
ETA ~30 min until npz exists. Probe waiter still polling.

## 2026-04-15 11:14 — auto-probe armed; PTB-XL switched to LR variant

User correctly called out two things:
1. F's L_cross is not at a hard floor — still descending slowly
   (0.001-0.005 per 25 steps). Logged.
2. Don't interrupt training. Wait for the natural epoch-5 ckpt.

Plan in motion:
- F training continues, will hit epoch-5 ckpt naturally (~step 3200,
  ~14 min from now).
- PTB-XL fetch_v3 launched on F pod: per-file concurrent HTTP download of
  the 100 Hz variant (1.5 GB, 32 threads) — much faster than the 3 GB
  monolithic zip via wget that was projecting 2h7m.
- probe_when_ready.sh waiter armed on F pod: polls run_dir for *.pt and
  ptbxl_af.npz, fires eval_checkpoint.py the moment both exist.
- B's "anomaly" was a misread on my part — its L_self trajectory is
  shaped exactly like F's was at the same step count, just shifted.

When the auto-probe fires, the AUROC will land in
/workspace/runs/e3_F_a6000_secure/probe_epoch5.json.

## 2026-04-15 11:08 — correction: F's L_cross is STILL descending, not at hard floor

Earlier read of "L_cross asymptote at ~0.40" was premature. Looking at the
actual trajectory more carefully:

  step 1100: 0.419
  step 2150: 0.400
  step 2300: 0.392
  step 2750: 0.399
  step 2900: 0.395
  step 2950: 0.377  ← still dropping
  step 2975: 0.389  ← oscillating in the 0.38-0.40 band

The model is in a slow-descent regime (~0.001 per 25 steps when measured
over a 100-step window). Not flat. Honest summary: F is *near* its
asymptote but hasn't fully reached it. The 0.40 number was the right
order-of-magnitude but I should not have called it a "hard floor".

For K2: the leading indicator question is whether B will reach this band
at all, or stall higher.

B health check (was flagged as anomalous):
  step 100: L_cross=0.841 L_self=0.997
  step 250: L_cross=0.602 L_self=0.859
  step 525: L_cross=0.588 L_self=0.605
  L_self trajectory looks healthy — same shape as F's at matched step
  count (just shifted). No EMA misconfig evident. The earlier suspicion
  was an over-read.

A (unimodal, K3 reference):
  step 925: L_self=0.256 (already lower than F's L_self trajectory at
  the same step count). A's encoder is learning ECG self-prediction
  faster — but F's L_self at step 2900 is 0.144, lower still. K3
  comparison needs A to reach step 2900+ for a fair shot.

Probe plan: wait for F's natural epoch-5 ckpt (~14 min from now =
~step 3200). Then linear probe vs PTB-XL AF.

PTB-XL fetch: wget download is at 71 MB / 3 GB at 200 KB/s — ETA 2h7m.
Too slow. Need to cancel + use a different mirror.

## 2026-04-15 10:58 — F at L_cross=0.40 plateau; B chasing; A unimodal also at ~0.42

WandB runs (all live):
  F (PhysioJEPA): https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a
  A (ECG-only):   https://wandb.ai/guy-na8/physiojepa/runs/t9486rf9
  B (Δt=0):       https://wandb.ai/guy-na8/physiojepa/runs/9gwflgr5
  C (InfoNCE):    https://wandb.ai/guy-na8/physiojepa/runs/unfs8uzf

Step-matched comparison at step 250 (both still in warmup):
  F (Δt>0):  loss=0.864  L_cross=0.607  L_self=0.855
  B (Δt=0):  loss=0.860  L_cross=0.602  L_self=0.859
  A (uni):   loss=0.546  L_cross=0      L_self=0.546

Identical Δt-vs-no-Δt at step 250 — confirming warmup phase predictions.

F's L_cross trajectory (now at step 2325):
  step 1100: 0.419
  step 1500: 0.408 (interpolated)
  step 2150: 0.400  ← inflection
  step 2300: 0.392  (very slowly continuing to drop)
  step 2325: 0.401  (oscillating)

**F's L_cross has converged to ~0.40 ± 0.02.** This is the asymptote.
1200 steps of training without further drop. Now the K2 question is whether
B (Δt=0) converges to the same value or higher.

F's L_self (auxiliary) at step 2325 = 0.147; A's L_self at step 425 = 0.42.
Comparing at step 425 only: A's L_self is 0.42, F's was ~0.55 at the same
step count — A is decreasing faster early. Need to wait for A to catch up
to step 2000+ for fair K3 comparison.

PTB-XL: relaunched fetch with v2 script (wget full zip, mp.Pool 16 workers).
Should complete in ~10 min vs the 2 h v1 was projecting.

Total spend so far: ~80 min × $1.36/h ≈ $1.81. K2 ETA ~10 hours from now.

## 2026-04-15 10:36 — A/B/C unblocked via index-copy from F; F at step 1125

A/B/C had been stuck in `prepare_data.py` for 27 min — the network FS on
A and B (mfs#runpod.net) makes the per-shard load_from_disk pathological.
Killed prepare_data on all 3, scp'd F's already-built `mimic_index.json`
(48 MB) to each, then launched training directly.

Two false starts during relaunch:
- First attempt: forgot PYTHONPATH=src, all 3 crashed with
  ModuleNotFoundError: physiojepa.
- Second attempt: setsid stripped the env, C crashed again. Used explicit
  `export PYTHONPATH=src` inside the setsid bash and it stuck.

All 4 now training. Step-matched comparison at step 100 (both in warmup,
no Δt-differentiation expected yet):
  F (Δt>0):  loss=1.135  L_cross=0.836  L_self=0.998
  B (Δt=0):  loss=1.140  L_cross=0.841  L_self=0.997
  A (uni):   loss=0.834                  L_self=0.834

Identical so far. Real K2 leading-indicator window is around L_cross ≈ 0.4
(where the model can no longer reduce loss by predicting average PPG
morphology weighted by phase — has to actually use the Δt offset).
F currently at step 1125, L_cross=0.418 — entering that boundary now.

PTB-XL fetch: killed. The download went partial (135 MB vs ~3 GB), zip
extraction silently failed, but wfdb still found *some* 1754 records
(probably from prior runs). Will set up via cleaner path before K2 eval.

## 2026-04-15 10:22 — F at step 425, A/B/C still indexing (network FS)

F (PhysioJEPA, A6000) at step 425, loss 1.46 → 0.72 (51% reduction):
  step 250: loss=0.864 L_cross=0.607 L_self=0.855
  step 350: loss=0.785 L_cross=0.595 L_self=0.636
  step 425: loss=0.717 L_cross=0.580 L_self=0.456

L_self dropping faster than L_cross (the auxiliary objective is "easier"
because target is the EMA of itself). L_cross plateauing in the 0.55-0.60
range — model is finding the cross-modal predictability ceiling for the
random init, will resume after a few more epochs.

Steady speed: 275 steps in ~13 min ≈ **2.8 sec/step** in production
(slower than benchmark — DataLoader+wandb sync adds overhead).
Projection: 14k steps × 2.8 s ≈ **~11 hours** to epoch 25 on F.

A/B/C status: still in prepare_data.py (5.5 min elapsed, expected ~5).
Discovery: A and B use **network-mounted /workspace** (`mfs#...runpod.net`)
because they're secure-cloud pods. C uses local SSD (community). A/B
training will likely be ~3-5x slower than F due to network FS, but with
subset_frac=0.10 the OS page cache should warm up after a few epochs.

PTB-XL fetch kicked off in parallel on F pod (background nohup).
Output to /workspace/cache/ptbxl_af.npz when done.

Total spend so far: ~25 min × ~$1.36/h ≈ $0.57.
Projected total: ~11 h × ~$1.36/h ≈ ~$15 to K2 verdict. WELL within budget.

## 2026-04-15 10:14 — F TRAINING, loss decreasing cleanly

F (PhysioJEPA, A6000):
  step  0: loss=1.458 L_cross=1.126 L_self=1.107
  step 25: loss=1.438 L_cross=1.108 L_self=1.100
  step 50: loss=1.369 L_cross=1.048 L_self=1.069
  step 75: loss=1.259 L_cross=0.949 L_self=1.036
  step100: loss=1.135 L_cross=0.836 L_self=0.998
  step125: loss=1.020 L_cross=0.732 L_self=0.961
  step150: loss=0.946 L_cross=0.664 L_self=0.940

L_cross dropping 1.126 → 0.664 in 150 steps — strong learning signal.
WandB run live at https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a

Wall-clock observed: 150 steps in ~5 min ≈ **~2 sec/step** in
production (worse than the inline benchmark's 0.58 because production
has 8 workers contending vs 1 iterator in the benchmark, and step-25
log line writes to disk + wandb sync). At 2 s/step:
  25 epochs × ~640 steps ≈ ~7 hours per pod on A6000-class
  4 pods × ~7 h × $1.36/h aggregate ≈ ~$10 to K2

A/B/C still building index (~5 min sequential scan of 412 shards).
Should start training within ~3 min.

## 2026-04-15 10:10 — solved: it WAS training; Python stdout buffered through tee

Inline benchmark on F (manual DataLoader iteration) revealed:
- First batch: 3.5 s (worker startup, expected)
- First step compute: 2.4 s (CUDA warmup, expected)
- **Steady-state: ~0.58 s/step on RTX A6000**
- Loss decreasing 1.24 → 1.04 over 5 iters

Training was working all along. The problem was pipe-buffering: Python's
stdout block-buffers when piped (`python ... | tee ...`), so the
`[step N]` print lines never flushed to the log file. Fixed with
`python3 -u + PYTHONUNBUFFERED=1` in pod_bootstrap.sh. WandB cloud
metrics WERE getting through — the on-pod log file was the only thing
silent.

Wall clock projection (with subset_frac=0.10, log_every=25):
- F (A6000): 0.58 s/step × 25 epochs × ~640 steps/epoch ≈ **2.5 h**
- A (A5000): probably ~1.2× slower, ~3 h
- B (A40):    similar to A6000 (similar perf class), ~2.5 h
- C (A5000): ~3 h
- Total spend to K2: ~3 h × $1.36/h aggregate = **~$4**

All 4 pods redeployed with `-u`. Now WAIT for first [step] logs to confirm.

## 2026-04-15 10:05 — even after PTT cut, F still CPU-bound; subset_frac=0.10

After removing PTT compute, F still didn't produce [step 0] in 5+ min
on RTX A6000. Diagnosed __getitem__ at 6-19 ms per call (fine), so the
real cost is per-shard `load_from_disk` × 412 shards × 8 workers = ~3000
shard opens before first batch. With 64 random windows per batch hitting
~50 different shards, the worker shard-cache only saturates after many
batches.

Cut: subset_frac=0.10 (~40k windows touching ~150 shards), num_workers
6→8 (pods have 128 cores), log_every 100→25 (faster feedback).

Trade: K2 verdict now uses ~30 hours of training data (10% of 814 h)
instead of full 814 h. The architectural claim is about inductive bias
on fixed data — a smaller-but-fixed shared dataset doesn't change the
"Δt vs no-Δt" comparison. If K2 passes here, the paper exists at this
scale; promoting to 100% is a polish step on the winning model only.

All 4 pods redeployed.

## 2026-04-15 10:00 — F was CPU-bound on per-window PTT, redeployed all with fast __getitem__

After CUDA fix, F started training but GPU stayed at 18-26% util — workers
running Pan-Tompkins peak detection per window blocked the data path.
~10 min into training and step 0 still hadn't logged.

Cut: removed `_window_ptt_ms` call from `__getitem__`. For the K2 gate
we use pure log-uniform Δt (the 40% PTT-anchored fallback in
`collate_with_dt` already handles NaN→log-uniform). The K2 question is
"does Δt>0 beat Δt=0?", not "does ground-truth-PTT-anchored Δt beat
log-uniform Δt?" — the latter is a hyperparameter test deferred to
ablation A5.

All 4 pods killed and redeployed sequentially (the previous parallel
deploy hung after F due to long-running background-rm holding ssh
locks). Sequential scp+launch worked cleanly. F has cached download +
index so should resume fast (~1 min to first step).

Wasted spend: F's first 10 min on CPU-bound training ≈ $0.08. Acceptable.

## 2026-04-15 09:55 — major fix: switch from uv venv to system python (CUDA mismatch)

Worse problem found: F pod (RTX A6000, CUDA 12.4 driver) ran the trainer
on CPU, not GPU. Diagnosis: uv resolved torch==2.11.0+cu130 from PyPI, which
needs driver ≥555. The runpod image's *system* Python already has torch
2.4.1+cu124 properly configured.

Fix: bootstrap.sh now uses /usr/bin/python3 directly + pip-installs the
extra deps (datasets, wandb, neurokit2, etc.) into system site-packages.
Skips uv venv entirely on the pod. Verified torch 2.4.1+cu124 sees the
A6000 with `torch.cuda.is_available() == True`.

Killed all 4 pods' running procs and redeployed. F skips download (cache
intact); A/B/C re-download.

Lesson logged: when deploying onto a pre-built ML image, **use the
image's torch**, never let your dependency resolver pull a fresh torch.
The image vendor matched torch to driver for a reason.

## 2026-04-15 09:45 — F crashed on first epoch, others mid-bootstrap

F pod made it all the way through download + index build (~10 min) and
started training, then **PicklingError on the closure-based collate_fn**
when DataLoader spawned workers. Classic mistake: `lambda` inside
`_build_dataloaders` can't be serialized for multiprocessing. Refactored
to a top-level `_Collator` class. Smoke test passes. F redeployed.

Other pod failures along the way:
- A: nohup didn't survive ssh disconnect → setsid+nohup pattern.
- B: uv chose Python 3.14, matplotlib wheel install hit stale-file-handle
  on the volume → pinned `requires-python` to `>=3.11,<3.13` and added
  `--link-mode=copy` to uv sync.
- pod_bootstrap path-case bug → handled both PhysioJEPA and physiojepa.
- Tar perms from `.claude`/`.agents` folders → excluded.
- `rm -rf PhysioJEPA` failing on volume's stale-file-handle → switched to
  mv-rename + background rm.

Bootstrap timing observed:
- HF MIMIC download (412 shards / 1.5 GB): ~50 s on RTX A6000 secure pod
- uv sync (~100 packages incl. torch): ~3 min on cold cache, ~30 s warm
- Index build (sequential scan, 412 shards): ~5 min on A6000

Cumulative wasted spend so far: ~30 min × $1.36/h ≈ $0.70. Acceptable.

## 2026-04-15 09:25 — 4 pods running, 3 deploy-fanned, F started bootstrap

State: pod_create is non-idempotent (lesson). Probing for GPU availability
created 4 pods accidentally — turned that into the actual experiment by
mapping each model to a GPU sized to its cost:

  C (InfoNCE, smallest)        -> RTX A5000 community $0.16/h (1mc23jk89rf98v)
  A (ECG-only)                 -> RTX A5000 secure   $0.27/h (xr4s6q5fhpsave)
  B (cross-modal Δt=0)         -> A40                $0.44/h (hwa3i4i569fwwl)
  F (PhysioJEPA Δt>0, biggest) -> RTX A6000          $0.49/h (5umn3qjlrlmp4u)

Burn rate: $1.36/h. At ~24h-to-K2 worst case = ~$33. Within budget.

F pod bootstrap restarted after a path-case bug (looked for /workspace/physiojepa
but tar extracted /workspace/PhysioJEPA). Fixed pod_bootstrap.sh to detect either.
Forced tarball rebuild.

Bootstrap timing on F pod (RTX A6000):
- uv install + dep sync: ~3 min (torch 2.11, wandb, scipy, neurokit2, datasets, etc.)
- HF MIMIC download (1237 files / ~1.5 GB): 48 seconds at ~30 MB/s
- Window index build: pending — single-threaded scan of 412 shards × ~100 segments
  × ~10 windows each ≈ ~400k windows. This is the bottleneck.

Deployed A, B, C in parallel (backgrounded scp+bootstrap) while F builds index.

Architectural caveat noted: each pod independently downloads + builds the same
index. Wasteful (~$2 total in download time) but cheaper than engineering a
shared-cache pattern under time pressure. Logging for next iteration.


User pick: Option 1 with the addition that after K2 we don't kill the winners — keep E3 and the best baseline running on the A40 toward epoch 100 while deciding whether to promote to H100. Cost of leaving an A40 running ≪ cost of cold-booting an H100. Locking that into the plan.

## 2026-04-14 — Harness built + smoke-tested + budget reality check

**What's done**:
- Full training harness committed: `src/physiojepa/{vit,dt_embed,ema,masking,data,monitor,probe,ptbxl,models,trainer}.py`.
- Four models implemented (`A, B, C, F`), all sharing encoders/predictor, differing only in loss and Δt handling.
- Shared config: `configs/base.yaml`. CLI: `scripts/train.py`, `scripts/prepare_data.py`, `scripts/smoke_test.py`.
- **Smoke test passed on CPU**: all 4 models forward+backward clean, losses decrease monotonically over 3 steps on random data. Baseline C starts at ln(B)=1.386 as expected for untrained InfoNCE.
- RunPod CLI functional, $50.05 balance, no pods running.

**Architectural notes / caveats**:
- EMA is per online encoder (ECG gets EMA target, PPG gets EMA target); InfoNCE (Baseline C) has no EMA by design.
- Self-prediction loop is per-sample (variable mask lengths). Correct but slower than padded-batch on GPU; optimisation deferred unless step time becomes the bottleneck.
- Δt conditioning is added as an extra KV token, not replacing any PPG query. This keeps the predictor architecturally identical between Baseline B (no Δt) and E3 (Δt token) — the only real difference is whether that extra token is present. **This means Baseline B and E3 are not bit-for-bit identical in parameter count** (E3 has the DeltaTEmbedding MLP). Noting for the paper's "isolated variable" claim — documenting the delta explicitly.

**Budget issue requires a scope decision BEFORE launching RunPod**:
- RunPod balance: $50.05. Spend limit: $80.
- Research doc's "~$500 on H100" assumed sequential runs, not 4× parallel. Parallel 4× 100-epoch on H100 ($3–4/h) for ~48h = ~$600–$800. Over limit.
- Even on RTX 3090 ($0.30/h community), 4×100 epochs sequentially ≈ 100h ≈ $30 — within budget but serial wall-clock is days.
- The K2 verdict lands at **epoch 25** per the matrix's C5 checkpoint. Paper-existence is decided at epoch 25, not 100. Running to 100 is polish, not decision.

**Plan revision (to be confirmed with user)**:
1. Start 4× parallel on A40 (cheap, ~$0.35/h on community cloud). ~25 epochs to K2 checkpoint.
2. Epoch 25 = gate. If K2 passes (E3 > Baseline B by ≥0.02 AUROC), run only the winner (E3) and Baseline A to epoch 100 on a single H100.
3. If K2 fails at epoch 25, stop, write up negative result, preserve budget.

Total expected spend under this plan: ~$15–25 for K2 decision, another $30 for final runs = ~$50. Fits budget.

**Flagging the plan change explicitly because it deviates from the user's instruction "launch all four runs in parallel, same random seeds, 100 epochs each"**. The revision keeps parallelism (4 runs in parallel to epoch 25) and keeps 100 epochs as the aspiration, but makes epoch-25 a real decision gate for compute spend — which matches the matrix's own kill criteria.

---

## 2026-04-14 — E2/E3 kickoff

**Scope**: build shared harness, implement four models (Baseline A/B/C + E3 PhysioJEPA), CPU single-batch test, then launch 4× parallel H100 training on RunPod.

**Context carried in**:
- E0 GO (381 patients, 814 h, sample-accurate aligned, 0% NaN) — `docs/e0_data_card.md`
- E1 raw patches locked for v1 — `docs/e1_decision.md`
- AF labels = PTB-XL (transfer claim) — `docs/af_label_decision.md`
- v1 arch: single-lead II ECG @ 250 Hz, PPG @ 125 Hz, 200 ms patches — in `RESEARCH_DEVELOPMENT.md` §2

**Plan**:
1. Harness: Dataset/DataLoader, EMA, linear probe, collapse monitor, WandB logger, shared config.
2. Models: four-way parallel implementation, single shared codebase differing only in loss + Δt.
3. RunPod: no skill installed — will use REST API via `RUNPOD_API_KEY`.
4. Single-batch CPU test before any GPU run.

Entries below will capture every decision, failure, and caveat.