| # PhysioJEPA research log |
| *Running narrative β newest entries at top.* |
|
|
| Format: each entry is `## YYYY-MM-DD HH:MM β [PHASE] β topic` followed by bullet list of what was done, what was found, and any decisions/caveats. |
|
|
| --- |
|
|
| ## 2026-04-16 09:35 β definitive run: all 3 pods bootstrapping |
|
|
| All 3 definitive-run pods deployed: |
|
|
| F: H100 PCIe secure ($2.39/h) @ 216.81.245.97:18654 β still in index build |
| A: A100 SXM comm ($1.39/h) @ 216.249.100.66:20011 β in precompute (454k windows) |
| B: A100 SXM secure ($1.49/h) @ 154.54.102.26:17999 β just started pip install |
|
|
| Config: 100 epochs, full data (subset_frac=1.0 via fast_cache_dir mmap), |
| mask_ratio=0.75, batch_size=64, seed=42, num_workers=12. |
|
|
| Aggregate: $5.27/h. Balance: $118.90. At 20h projected = $105. |
|
|
| Pipeline: HF download (~2 min) β index build (~5-20 min, depends on network) β |
| precompute_windows (~15-30 min for 454k windows, single-threaded) β training. |
| |
| A is furthest along (precompute started). F is behind (slower download). |
| B just started. First [step 0] expected in ~30 min from A. |
| |
| ## 2026-04-16 04:40 β full-scale run scoping: need data pipeline optimization first |
| |
| User requested 3Γ H100, full data, 100 epochs, mask=0.75. Budget check: |
| - Balance: $118.90. H100 PCIe community: $1.99/h Γ 3 = $5.97/h. |
| - Steps: ~6160/epoch Γ 100 = 616k per run. |
| - sec/step on A40 was 2.8 (production) vs 0.58 (benchmark). Even on H100 |
| with faster CPU, realistic production sec/step is ~1.0-1.5. |
| - At 1.2 sec/step: 616k Γ 1.2 / 3600 = 205h per run Γ 3 runs Γ $2/h = $1230. WAY over budget. |
| |
| Root cause: __getitem__ calls load_from_disk per-shard + bandpass + zscore |
| per window at runtime. This dominates training time by 5Γ over GPU forward. |
| |
| Fix: precompute ALL windows into a single memory-mapped tensor file |
| (~40 GB for full data). __getitem__ becomes a single mmap read (~0.1ms). |
| sec/step drops to ~0.3, bringing total runtime to ~51h across 3 A100 runs |
| = ~$100. Fits budget. |
| |
| Building the precompute script now. |
| |
| ## 2026-04-16 04:25 β FINAL: abl3 ep25 = 0.848, all pods killed |
| |
| **abl3 (mask=0.75, unimodal A) epoch 25 AUROC = 0.848.** |
| |
| Complete results table: |
| |
| | Model | mask | L_self peak | ep5 | ep10 | ep15 | ep20 | ep25 | |
| |------------------|------|-------------|-------|-------|-------|-------|-------| |
| | original A | 0.50 | 0.476 | 0.783 | 0.736 | β | β | 0.703 | |
| | abl1 (pd=1) | 0.50 | 0.438 | β | β | 0.749 | β | β | |
| | abl2 (sin-q) | 0.50 | 0.559 | β | β | 0.784 | β | β | |
| | **abl3 (m=75)** | **0.75** | **0.200** | β | β | 0.838 | 0.845 | **0.848** | |
| | abl4 (full data) | 0.50 | 0.587+ | β | β | β | β | (killed; spike confirmed) | |
| | B (Ξt=0) | β | β | 0.660 | 0.844 | β | β | 0.847 | |
| | F (Ξt>0) | β | β | 0.652 | 0.859 | β | β | 0.835 | |
|
|
| **abl3 (0.848) β B (0.847).** Unimodal JEPA with 75% masking exactly |
| matches cross-modal JEPA. The mechanism story is complete. |
|
|
| abl4 (full data, 50% mask) showed L_self spike peaking at 0.587 and |
| still rising at step 13975 β confirming the spike is not a small-data |
| artefact. Killed early (spike confirmed; no need to wait for its |
| epoch-25 AUROC β we already know 50% mask at scale still degrades). |
| |
| All pods killed. Zero stale compute. Total ablation spend: ~$4.50. |
| |
| ## 2026-04-16 03:10 β AUROC confirms mechanism end-to-end |
| |
| Epoch-15 AUROC on PTB-XL AF: |
| |
| | variant | L_self peak | AUROC @ ep15 | |
| |-----------------|-------------|--------------| |
| | original A | 0.476 | 0.736 | |
| | abl1 (pd=1) | 0.438 | 0.749 | |
| | abl2 (sin-q) | 0.559 | 0.784 | |
| | **abl3 (m=75)** | **0.196** | **0.838** | |
| | (ref) B ep10 | β | 0.844 | |
| | (ref) F ep10 | β | 0.859 | |
|
|
| **abl3 matches B/F's AUROC at epoch 15.** Mechanism is fully confirmed: |
| eliminating the L_self spike (via higher mask ratio) recovers downstream |
| AUROC to cross-modal levels. Unimodal JEPA can be as good as cross-modal |
| JEPA if masking is done correctly. |
| |
| Subtle finding from abl2: sinusoidal query has a LARGER L_self spike |
| (0.559 vs orig 0.476) but HIGHER AUROC (0.784 vs 0.736). So the spike |
| and AUROC are not perfectly coupled β the predictor being "worse" |
| (non-adaptive queries) apparently forces more information into the |
| encoder, which helps downstream. Noting as an interesting secondary |
| finding, but abl3 is the main story. |
|
|
| abl1 (pred_depth=1) is essentially identical to orig A on both metrics β |
| confirming predictor capacity is not the lever. |
| |
| ### Paper now has a clean, precise story |
| |
| 1. Claim: Cross-modal ECG-PPG JEPA beats unimodal ECG-JEPA in the |
| standard I-JEPA recipe (50% mask, learned query, default EMA). |
| 2. Mechanism: at 50% mask the predictor finds a local-interpolation |
| shortcut (25 visible context β 25 target contiguous blocks β linear |
| blend of adjacent patches works). Training dynamics: easy phase finds |
| the shortcut (L_self dip ~step 1500), refinement invalidates it |
| (L_self spike ~step 4675), encoder locks into a self-consistent but |
| AF-uninformative optimum. |
| 3. Fixes: (a) mask ratio 0.75 denies the shortcut structurally β abl3 |
| matches cross-modal AUROC. (b) Cross-modal prediction is the same |
| mechanism β 0% PPG visible context β no interpolation path β F and B |
| both stable. |
| 4. Ξt direction doesn't matter (K2 fail is a negative result that |
| supports the mechanism: the Ξt token is a tiny perturbation of the |
| predictor's query set; what matters is whether interpolation is |
| available, not where the targets sit on the time axis). |
| |
| Actionable recommendation: ECG-JEPA (Weimann & Conrad) used 50% masking. |
| 75% masking is a likely-free improvement, testable on PTB-XL directly. |
| |
| ### Status |
| |
| - abl1 + abl2 pods killed. Answered their questions. |
| - abl3 running to epoch 25 for the final number. ~1 h left at $0.44/h. |
| - abl4 (full data) at step 9975 with L_self=0.54 β **spike IS present |
| at full data**, just delayed. More data slows shortcut discovery but |
| doesn't eliminate it. Confirms mask ratio is the architectural fix, |
| not a small-data artifact. |
| - abl4 still has ~20h to go. Decision: let it finish to get the |
| full-data AUROC β the "full data under the WRONG mask ratio" number |
| is informative. At $0.44/h Γ 20h = $8.80. Still well under budget. |
|
|
| ## 2026-04-16 02:05 β mask_ratio IS the lever (spike window confirmed) |
| |
| Full matrix at the critical spike window (original A peaks L_self=0.476 at step 4675): |
|
|
| step | orig A | abl1 (pd=1) | abl2 (sin-q) | **abl3 (m=75)** | abl4 (full) |
| ------+--------+-------------+--------------+-----------------+------------ |
| 1475 | 0.220 | 0.222 | 0.329 | **0.146** | 0.296 |
| 2475 | 0.340 | 0.339 | 0.482 | **0.165** | 0.233 |
| 3475 | 0.442 | 0.420 | 0.555 | **0.186** | 0.208 |
| 4475 | 0.476 | 0.438 | 0.559 | **0.196** | 0.260 |
| 4975 | 0.475 | 0.398 | 0.551 | **0.200** | 0.287 |
| 5475 | β | 0.334 | 0.512 | β | 0.313 |
|
|
| **abl3 (mask 0.75) has NO spike.** L_self rises monotonically from 0.146 |
| (step 1475) to 0.200 (step 4975) β a gentle climb of +0.05 over 3500 steps, |
| vs orig A's explosive +0.26 peak. |
| |
| **abl1 (pred_depth=1) tracks orig A**. Predictor capacity is not the lever. |
| |
| **abl2 (sinusoidal queries) has a LARGER spike than orig A** (0.559 peak vs |
| 0.476). Removing the adaptive query hurts β the predictor can't route |
| context tokens to targets it cares about. |
| |
| **abl4 (full data) shows a muted spike** (0.208 β 0.313 over 2000 steps). |
| 10Γ data slows shortcut discovery but doesn't eliminate it. Suggests scale |
| helps but mask_ratio is the cleaner fix. |
|
|
| ### Revised mechanism β unified story |
|
|
| 50% masking gives the predictor 25 target patches and 25 visible context |
| patches arranged in contiguous blocks. Early training, the predictor |
| learns a short-range interpolation shortcut: predict masked patch `p` as |
| a linear blend of adjacent visible patches. This gives a low L_self quickly |
| (dip at step 1500). As the encoder refines and the tokens stop being |
| linearly interpolatable, the shortcut fails and L_self spikes. |
|
|
| At 75% masking (12 visible β 37 target), no local interpolation is available |
| β the predictor MUST learn long-range structure from the start. No dip, |
| no rebound. |
|
|
| Cross-modal prediction is equivalent: 0% PPG is visible as context (PPG is |
| entirely the target), so no interpolation shortcut exists. F and B dodge |
| the spike by the same mechanism as abl3. |
|
|
| **Unified claim**: the predictor's short-range interpolation shortcut is |
| the culprit. Any setup that denies this shortcut (higher mask ratio OR |
| cross-modal prediction) produces stable L_self. This is a cleaner, more |
| specific mechanism than "cross-modal helps" β it pinpoints the interaction |
| between predictor capacity and the fraction of visible context. |
| |
| ### Next test: AUROC recovery |
| |
| Does abl3's no-spike training actually produce better AF representations? |
| Kicked off PTB-XL fetch on abl3 pod in parallel with training. Will probe |
| all 4 ablation ckpts once training completes (~2-3 h). |
| |
| Prediction: if the mechanism story is correct, |
| abl3 AUROC @ ep25 > orig A's 0.703, should approach F/B's 0.83-0.85. |
| |
| ## 2026-04-16 01:15 β ablation early signal: abl3 (mask 75%) breaks the pattern |
| |
| L_self side-by-side at matched steps (only the key ones): |
|
|
| step | orig A | abl1(pd=1) | abl2(sin-q) | abl3(m=75) | abl4(full) |
| ------+--------+------------+-------------+------------+----------- |
| 975 | 0.247 | 0.248 | 0.267 | 0.197 | 0.390 |
| 1475 | 0.220 | 0.223 | 0.292 | 0.144 | 0.285 (interp) |
| 1775 | 0.243 | 0.255 | 0.371 | 0.148 | 0.269 |
| 1975 | 0.256 | 0.269 | 0.403 | β | 0.254 |
| 2175 | 0.283 | 0.297 | 0.447 | β | 0.230 (interp) |
| |
| **abl3 (mask 0.75) is markedly different.** L_self at step 1775 is 0.148, |
| lower than original A's minimum of 0.220. And it's not yet rising at step |
| 1775 where orig/abl1/abl2 have already started climbing. |
| |
| **abl1 (pred_depth=1) β orig A.** The predictor size was not the driver. |
| |
| **abl2 (sinusoidal query) is WORSE than orig A.** By step 1775 it's at 0.371 |
| vs orig A at 0.243. Sinusoidal queries can't adapt to what the predictor |
| needs, so the predictor must over-attend to context tokens β and the |
| signal there is apparently too sparse to learn from. |
| |
| **abl4 (full data) is descending monotonically** at step 1975 (L_self=0.254). |
| Too early to say if it avoids the spike β original A's spike was at step 4675. |
| Full data is ~10Γ slower per logical training "epoch" so the spike location |
| in wall-clock terms shifts late. Continue monitoring. |
|
|
| **Revised mechanism hypothesis**: unimodal JEPA at mask_ratio=0.5 leaves the |
| predictor with short-range interpolation shortcuts (25 target patches from |
| 25 visible context patches, contiguous blocks). Early training finds these |
| shortcuts (L_self dips at step 1500). As the encoder refines and |
| invalidates the shortcuts, L_self rises. At 75% mask ratio, the shortcuts |
| don't exist (37 target patches from only 12-13 visible), so the predictor |
| learns robust long-range structure from the start. No dip-and-rebound. |
| |
| This is mechanism-specific, falsifiable, and explains both: |
| (a) why F/B didn't drift (cross-modal loss provides a diverse, non-local |
| target that can't be locally interpolated) |
| (b) why abl3 fixed it in unimodal A (higher masking also eliminates the |
| local shortcut) |
| |
| Now the critical follow-up: does abl3's epoch-25 AUROC match F/B (~0.84)? |
| That would complete the mechanism-to-downstream story. |
| |
| Cost check: 4ΓA40Γ$0.44 Γ ~45 min = ~$1.32 so far. abl1/2/3 ~3.5 h to go |
| (~$5). abl4 ~30 h to go (~$13). Total ~$20 for the suite. Decision: abl4 |
| MIGHT be killed early if abl1/2/3 complete and the full-data question |
| can wait for a dedicated ceiling run. |
| |
| ## 2026-04-16 00:30 β 4 parallel A ablations launched on A40 secure pods |
| |
| To find the real mechanism behind A's degradation, running 4 ablations |
| in parallel. Each identical to original A except one variable. |
| |
| abl1: pred_depth 4 β 1 (pod 0n8im5mri5hjk0, 69.30.85.78:22121) |
| abl2: query_mode learned β sinusoidal (pod a2pye2ki7uvw47, 194.68.245.208:22053) |
| abl3: mask_ratio 0.5 β 0.75 (pod jwwln4klav8674, 194.68.245.207:22198) |
| abl4: subset_frac 0.10 β 1.00 (pod 4pvp7yb1rmbxta, 194.68.245.207:22197) |
| |
| All on A40 secure ($0.44/h Γ 4 = $1.76/h aggregate). 25 epochs each. |
| abl4 has 10Γ the data so will take much longer (~20-40 h vs ~4 h for the others) |
| β but the others should answer the architectural question by ~04:30. |
| |
| Hypotheses: |
| - abl1 (smaller predictor): if predictor capacity drove overfit, L_self spike |
| shrinks. AUROC may improve. |
| - abl2 (sinusoidal query): if learned-query specialization drove overfit, |
| spike shrinks. AUROC may improve. |
| - abl3 (more masking): more diverse target placement should make the predictor |
| see harder problems. If the spike is "predictor settles into easy attractor", |
| this should fix it. |
| - abl4 (full data): if 10% subset was the culprit, spike disappears at scale. |
| If still present, it's an architectural issue independent of data scale. |
|
|
| Spike location to compare against: original A had L_self spike peaking 0.475 |
| at step 4675 (when Ο=0.9999). |
| |
| ## 2026-04-15 21:59 β slow-Ο A ablation RESULT: hypothesis FALSIFIED, pod killed |
| |
| Side-by-side L_self at matched steps: |
|
|
| step | orig A | slow-Ο A | orig Ο | slow Ο |
| ------+--------+----------+--------+-------- |
| 1475 | 0.22 | 0.22 | 0.9969 | 0.9962 |
| 1975 | 0.26 | 0.28 | 0.9974 | 0.9963 |
| 2975 | 0.40 | 0.49 | 0.9988 | 0.9967 |
| 3975 | 0.45 | 0.60 | 0.9997 | 0.9972 |
| 4975 | 0.47 | 0.60 | 0.9999 | 0.9977 |
| 5475 | 0.46 | 0.55 | 0.9999 | 0.9979 |
|
|
| Slow-Ο A's L_self rose MORE than original A's, not less, despite Ο being |
| well below saturation through the critical window. The "Ο saturation |
| amplifies the L_self spike" hypothesis is falsified. |
|
|
| The L_self rise must be driven by something else. Top candidates: |
| 1. Masking strategy (multi-block 50% ratio) + small data regime β the |
| predictor overfits to easy target patches early (dip at step 1500), |
| then the distribution of hard targets dominates as the encoder refines. |
| 2. Query-embedding parameter specialization β the learnable query tokens |
| narrow predictive scope, and random target placement starts hitting |
| targets they can't handle. |
| 3. Something about unimodal self-prediction specifically β F/B don't show |
| this precisely because the cross-modal loss provides diverse target |
| pressure the predictor can't overfit. |
| |
| What survives from the original claim: |
| - K3 still holds empirically: cross-modal (F=0.835, B=0.847) >> unimodal |
| (A=0.703) at epoch 25. |
| - The mechanism story needs replacing. "Cross-modal provides target |
| diversity the predictor can't overfit" is more defensible than the |
| original "anchors against Ο drift" claim. |
| |
| Pod y27osaqv7amz7d killed. Ablation cost: ~$0.35 for ~2 h on A5000 community. |
| |
| Impact on user's plan: |
| - Conditional was: if spike disappears β full-data B run. Spike did not |
| disappear. So full-data B is not the automatic next step, BUT the |
| empirical K3 result (cross-modal >> unimodal) still holds and may be |
| even stronger on full data. Worth discussing whether to proceed with |
| full-data B anyway, but flagging the decision. |
| |
| ## 2026-04-15 21:19 β slow-Ο A ablation training (early signal: L_self rising even pre-Ο-saturation) |
|
|
| Slow-Ο A early trajectory (log_every=25): |
| step 0: L_self = 1.167 (random init) |
| step 475: L_self = 0.390 |
| step 975: L_self = 0.247 |
| step 1475: L_self = 0.223 β minimum |
| step 1975: L_self = 0.282 |
| step 2175: L_self = 0.313 β rising, tau still only 0.9963 |
| |
| Original A at comparable steps (before any spike): |
| step 500: L_self = 0.380 |
| step 1000: L_self = 0.247 |
| step 1500: L_self = 0.220 β minimum |
| step 2000: L_self = 0.258 |
| step 2225: L_self = 0.283 |
|
|
| Slow-Ο A is tracking original A essentially step-for-step so far. Both hit |
| their minimum ~step 1500, both starting to rise by step 2000. **The early-phase |
| rise is apparently not driven by Ο saturation** β it starts well before Ο |
| hits 0.999. |
|
|
| This is an important early signal: my "Ο-saturation" mechanism may be |
| partially wrong. The late-training transient in original A was likely Ο- |
| saturation AMPLIFYING an already-present drift, not causing it. |
|
|
| Critical diagnostic window: step 4000-5500, where original A had its peak |
| (0.48 at step 4675). If slow-Ο A stays lower through this window, Ο still |
| drives the *amplitude* of the bump. If slow-Ο A also spikes at step 4675, |
| Ο is not the driver. |
|
|
| ## 2026-04-15 20:20 β slow-Ο A ablation launched |
|
|
| Ablation pod: y27osaqv7amz7d (RTX A5000 community, FR). Config: |
| ema_end = 0.999 (vs 0.9999 in original) |
| ema_warmup_frac = 0.60 (vs 0.30 in original) |
| everything else identical: subset_frac=0.10, bs=64, 25 epochs, seed=42 |
|
|
| Prediction: |
| - If A spike at step 4675 disappears + AUROC recovers to ~0.84 β Ο-saturation |
| mechanism is confirmed, cross-modal anchor story holds. |
| - If spike disappears BUT AUROC stays at ~0.70 β the original A's problem |
| wasn't Ο saturation per se; the unimodal objective just doesn't contain |
| enough AF-discriminative signal at this data scale. |
| - If spike still present β Ο schedule isn't the lever; something deeper. |
|
|
| Conditional on spike disappearing + AUROC recovering, next step is the |
| full-data B run (100 epochs, H100, 814h) β the ceiling measurement. |
|
|
| ## 2026-04-15 20:00 β refined mechanism for A degradation (not monotonic drift) |
|
|
| After pulling full WandB curves, correcting my earlier "A drifts monotonically" |
| claim. A actually has: |
|
|
| - L_self minimum at step 1500 (value 0.22) |
| - Ο-saturation TRANSIENT at step 4675 (value 0.475) β 3Γ the bump F/B show |
| - recovery by step 7400 (value 0.20) |
| - late-training slow climb to 0.20 at step 15350 |
| |
| **F and B also show late-training L_self rise** (0.15 β 0.27). Only the |
| mid-training transient is unique to A. |
| |
| Key finding: A's loss *recovers* but AUROC *doesn't*. AUROC dropped from |
| 0.783 (ep5) β 0.703 (ep25) even though final L_self is comparable to F/B. |
| The transient permanently damaged downstream utility β A's encoder locked |
| onto a self-consistent but AF-uninformative optimum during the Ο transition. |
|
|
| Refined paper claim: cross-modal training provides a smooth gradient signal |
| through the Ο-saturation transient. Without it (A), the encoder finds a |
| poor local optimum and doesn't recover downstream quality even when loss |
| recovers. The mechanism is more specific than "cross-modal helps" β it's |
| "cross-modal prevents Ο-saturation damage." |
|
|
| ## 2026-04-15 19:30 β FULL K-gate results: K2 FAIL, K3 PASS |
|
|
| All 4 pods ran to epoch 25. Full probe matrix on PTB-XL AF: |
|
|
| | Model | ep5 | ep10 | ep25 | |
| |-------|-----|------|------| |
| | F (Ξt>0) | 0.6521 | 0.8586 | 0.8352 | |
| | B (Ξt=0) | 0.6599 | 0.8440 | 0.8467 | |
| | A (uni) | 0.7832 | 0.7357 | 0.7025 | |
| | C (InfoNCE) | stuck at ~loss 3.0 β under-tuned baseline, not usable | |
|
|
| **K2 FAIL: F β B = β0.012 at epoch 25 (target was β₯ +0.02).** |
| **K3 PASS BIG: F β A = +0.133 at epoch 25, and A is DEGRADING.** |
|
|
| Written up in `docs/e2_e3_results.md` with full interpretation and |
| proposed pivot (cross-modal-anchor paper instead of Ξt paper). |
|
|
| Spend total: ~$6.14 across 4 pods Γ ~4.5 h. Vastly under budget. |
|
|
| Pods still have ckpt_final.pt but training is done. Ready to terminate. |
| |
| ## 2026-04-15 11:55 β FIRST AUROC: F at epoch 10 = 0.859 |
| |
| **F (PhysioJEPA, Ξt>0) AUROC on PTB-XL AF detection:** |
| epoch 5 (step ~3200): **0.652** |
| epoch 10 (step ~6400): **0.859** β latest |
| |
| The jump 0.65 β 0.86 in 5 epochs tells us F is rapidly absorbing AF-relevant |
| features. Trajectory still climbing β we'd expect further gains by epoch 25. |
| |
| Framing correction (user call-out): "approaching Weimann 0.945" overstates |
| the comparison β Weimann used 12-lead Γ 1M records Γ 100 epochs. F is |
| single-lead II Γ 40k windows Γ 10 epochs. What matters is the *trajectory*, |
| not the ceiling. |
| |
| The probe pipeline had one race condition: probe_when_ready.sh saw the |
| ptbxl_af.npz file appear at ~50% (np.savez_compressed wrote non-atomically), |
| fired eval_checkpoint.py which tried to unzip an incomplete file β BadZipFile. |
| Ran the probe manually once the write finished. Retro fix to |
| probe_when_ready.sh would be `[ -f foo ] && file foo | grep -q Zip` but |
| we're past it now. |
|
|
| **A (ECG-only unimodal) L_self REGRESSION β important finding:** |
| step 500: L_self = 0.380 |
| step 1000: L_self = 0.247 β minimum |
| step 1500: L_self = 0.220 β actual minimum |
| step 2500: L_self = 0.331 |
| step 3500: L_self = 0.442 |
| step 4500: L_self = 0.477 β now |
| step 5000: L_self = 0.472 (tau = 0.9999) |
| |
| A is DRIFTING β L_self doubled from 0.22 to 0.47 as EMA Ο saturated near 1.0. |
| Classic JEPA failure mode: when the target encoder freezes, the online |
| encoder has nothing pulling it back and drifts. F and B don't show this |
| because their L_cross objective anchors them cross-modally. |
| |
| Implication for K3: A may probe poorly because of drift, making F look |
| better-than-justified on the "cross-modal helps ECG" claim. Need to note |
| this as a limitation in the paper. The honest fix would be a smaller |
| final-Ο (say 0.999 instead of 0.9999) for A specifically, but we'll note |
| and move on for now. |
| |
| **C (InfoNCE) is NOW LEARNING** after the Ο fix + passing LR warmup: |
| step 0: loss = 4.168 (random) |
| step 100: 4.159 (still random) |
| step 500: ~3.8 (starting to move) |
| step 800: 2.90 β first clear signal |
| step 825: 2.98 |
| Slow but real. InfoNCE with batch 64 is known-weak (CLIP uses 32k). Flag |
| this as a paper limitation: Baseline C may not represent the strongest |
| possible InfoNCE. |
|
|
| State (12:05): |
| F: step 7400, L_cross=0.247 (still dropping), epoch-10 ckpt probed β 0.859 |
| B: step 2250, L_cross=0.401, no ckpt yet (epoch 5 ~ step 3200) |
| A: step 4600, L_self=0.464, ckpt_epoch005.pt available |
| C: step 825, loss=2.98, climbing out of random |
|
|
| Now running: PTB-XL fetch_v3 on A, B, C pods in parallel (~10 min). |
| Will probe A's ckpt_epoch005.pt the moment npz lands on A pod. |
|
|
| ## 2026-04-15 11:46 β F broke through "0.40 floor" β 0.33; C still stuck (LR warmup) |
|
|
| F at step 4750: L_cross = **0.327**. The earlier "asymptote at 0.40" call |
| was wrong twice over β model continued to descend. Trajectory: |
| |
| step 1100: 0.419 |
| step 2150: 0.400 |
| step 2950: 0.377 |
| step 4225: 0.384 (oscillating in 0.38-0.40) |
| step 4700: 0.374 |
| step 4750: 0.327 β clear break-through |
| |
| Possible explanation: Ο schedule (0.996β0.9999) has nearly completed |
| (Ο=0.9999 at step 4700+). Tighter EMA target β cleaner gradient signal |
| β model can now refine the L_cross target. This is consistent with |
| the published JEPA training dynamics. |
|
|
| C: still stuck at loss β 4.16 even with fixed Ο init. Most likely cause |
| is LR warmup (warmup_steps = 5540, currently at step 75 β LR β 1.4e-6). |
| Needs another ~500 steps to exit ramp. Will revisit at next check. |
| |
| B step 1175: L_cross = 0.459 β slope -0.04 / 100 steps. |
| A step 2250: L_self = 0.297. |
| PTB-XL fetch: 39%, ETA 24 min. |
| Probe waiter: still polling. |
| |
| ## 2026-04-15 11:30 β F's epoch-5 ckpt landed; B looks competitive; C broken (init bug) |
| |
| State: |
| - F: step 4225, L_cross=0.384, L_self=0.139, ckpt_epoch005.pt saved. |
| - B: step 1000, L_cross=0.499, L_self=0.339 β dropping smoothly. |
| - A: step 1850, L_self=0.238 β fast convergence on unimodal task. |
| - C: step 225, loss=4.07 (random baseline = ln(64) = 4.158). **Bug**. |
| |
| K2 leading-indicator preview (F vs B step-matched at step 1000): |
| F (Ξt>0): L_cross β 0.43 (interpolated) |
| B (Ξt=0): L_cross = 0.499 |
| Gap = 0.07 β F leads, but B is dropping faster currently. |
| K2 jury still out β need B at step 3000+ to see asymptote. |
| |
| C bug: init `log_tau = 0` makes the logit-temperature multiplier = 1.0, |
| i.e. physical Ο = 1.0 (very soft InfoNCE). Standard Ο = 0.07 means |
| multiplier β 14. Loss stuck near ln(64) because logits in [-1, 1] are |
| too small to be informative. Fix: init `log_tau = log(14)`. Will redeploy |
| C after F's probe AUROC lands. |
|
|
| PTB-XL fetch: at 25% download (15k of 43k files via concurrent HTTP). |
| ETA ~30 min until npz exists. Probe waiter still polling. |
|
|
| ## 2026-04-15 11:14 β auto-probe armed; PTB-XL switched to LR variant |
|
|
| User correctly called out two things: |
| 1. F's L_cross is not at a hard floor β still descending slowly |
| (0.001-0.005 per 25 steps). Logged. |
| 2. Don't interrupt training. Wait for the natural epoch-5 ckpt. |
| |
| Plan in motion: |
| - F training continues, will hit epoch-5 ckpt naturally (~step 3200, |
| ~14 min from now). |
| - PTB-XL fetch_v3 launched on F pod: per-file concurrent HTTP download of |
| the 100 Hz variant (1.5 GB, 32 threads) β much faster than the 3 GB |
| monolithic zip via wget that was projecting 2h7m. |
| - probe_when_ready.sh waiter armed on F pod: polls run_dir for *.pt and |
| ptbxl_af.npz, fires eval_checkpoint.py the moment both exist. |
| - B's "anomaly" was a misread on my part β its L_self trajectory is |
| shaped exactly like F's was at the same step count, just shifted. |
|
|
| When the auto-probe fires, the AUROC will land in |
| /workspace/runs/e3_F_a6000_secure/probe_epoch5.json. |
|
|
| ## 2026-04-15 11:08 β correction: F's L_cross is STILL descending, not at hard floor |
| |
| Earlier read of "L_cross asymptote at ~0.40" was premature. Looking at the |
| actual trajectory more carefully: |
|
|
| step 1100: 0.419 |
| step 2150: 0.400 |
| step 2300: 0.392 |
| step 2750: 0.399 |
| step 2900: 0.395 |
| step 2950: 0.377 β still dropping |
| step 2975: 0.389 β oscillating in the 0.38-0.40 band |
|
|
| The model is in a slow-descent regime (~0.001 per 25 steps when measured |
| over a 100-step window). Not flat. Honest summary: F is *near* its |
| asymptote but hasn't fully reached it. The 0.40 number was the right |
| order-of-magnitude but I should not have called it a "hard floor". |
|
|
| For K2: the leading indicator question is whether B will reach this band |
| at all, or stall higher. |
|
|
| B health check (was flagged as anomalous): |
| step 100: L_cross=0.841 L_self=0.997 |
| step 250: L_cross=0.602 L_self=0.859 |
| step 525: L_cross=0.588 L_self=0.605 |
| L_self trajectory looks healthy β same shape as F's at matched step |
| count (just shifted). No EMA misconfig evident. The earlier suspicion |
| was an over-read. |
| |
| A (unimodal, K3 reference): |
| step 925: L_self=0.256 (already lower than F's L_self trajectory at |
| the same step count). A's encoder is learning ECG self-prediction |
| faster β but F's L_self at step 2900 is 0.144, lower still. K3 |
| comparison needs A to reach step 2900+ for a fair shot. |
|
|
| Probe plan: wait for F's natural epoch-5 ckpt (~14 min from now = |
| ~step 3200). Then linear probe vs PTB-XL AF. |
|
|
| PTB-XL fetch: wget download is at 71 MB / 3 GB at 200 KB/s β ETA 2h7m. |
| Too slow. Need to cancel + use a different mirror. |
|
|
| ## 2026-04-15 10:58 β F at L_cross=0.40 plateau; B chasing; A unimodal also at ~0.42 |
| |
| WandB runs (all live): |
| F (PhysioJEPA): https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a |
| A (ECG-only): https://wandb.ai/guy-na8/physiojepa/runs/t9486rf9 |
| B (Ξt=0): https://wandb.ai/guy-na8/physiojepa/runs/9gwflgr5 |
| C (InfoNCE): https://wandb.ai/guy-na8/physiojepa/runs/unfs8uzf |
| |
| Step-matched comparison at step 250 (both still in warmup): |
| F (Ξt>0): loss=0.864 L_cross=0.607 L_self=0.855 |
| B (Ξt=0): loss=0.860 L_cross=0.602 L_self=0.859 |
| A (uni): loss=0.546 L_cross=0 L_self=0.546 |
| |
| Identical Ξt-vs-no-Ξt at step 250 β confirming warmup phase predictions. |
| |
| F's L_cross trajectory (now at step 2325): |
| step 1100: 0.419 |
| step 1500: 0.408 (interpolated) |
| step 2150: 0.400 β inflection |
| step 2300: 0.392 (very slowly continuing to drop) |
| step 2325: 0.401 (oscillating) |
|
|
| **F's L_cross has converged to ~0.40 Β± 0.02.** This is the asymptote. |
| 1200 steps of training without further drop. Now the K2 question is whether |
| B (Ξt=0) converges to the same value or higher. |
| |
| F's L_self (auxiliary) at step 2325 = 0.147; A's L_self at step 425 = 0.42. |
| Comparing at step 425 only: A's L_self is 0.42, F's was ~0.55 at the same |
| step count β A is decreasing faster early. Need to wait for A to catch up |
| to step 2000+ for fair K3 comparison. |
| |
| PTB-XL: relaunched fetch with v2 script (wget full zip, mp.Pool 16 workers). |
| Should complete in ~10 min vs the 2 h v1 was projecting. |
| |
| Total spend so far: ~80 min Γ $1.36/h β $1.81. K2 ETA ~10 hours from now. |
| |
| ## 2026-04-15 10:36 β A/B/C unblocked via index-copy from F; F at step 1125 |
| |
| A/B/C had been stuck in `prepare_data.py` for 27 min β the network FS on |
| A and B (mfs#runpod.net) makes the per-shard load_from_disk pathological. |
| Killed prepare_data on all 3, scp'd F's already-built `mimic_index.json` |
| (48 MB) to each, then launched training directly. |
| |
| Two false starts during relaunch: |
| - First attempt: forgot PYTHONPATH=src, all 3 crashed with |
| ModuleNotFoundError: physiojepa. |
| - Second attempt: setsid stripped the env, C crashed again. Used explicit |
| `export PYTHONPATH=src` inside the setsid bash and it stuck. |
| |
| All 4 now training. Step-matched comparison at step 100 (both in warmup, |
| no Ξt-differentiation expected yet): |
| F (Ξt>0): loss=1.135 L_cross=0.836 L_self=0.998 |
| B (Ξt=0): loss=1.140 L_cross=0.841 L_self=0.997 |
| A (uni): loss=0.834 L_self=0.834 |
| |
| Identical so far. Real K2 leading-indicator window is around L_cross β 0.4 |
| (where the model can no longer reduce loss by predicting average PPG |
| morphology weighted by phase β has to actually use the Ξt offset). |
| F currently at step 1125, L_cross=0.418 β entering that boundary now. |
| |
| PTB-XL fetch: killed. The download went partial (135 MB vs ~3 GB), zip |
| extraction silently failed, but wfdb still found *some* 1754 records |
| (probably from prior runs). Will set up via cleaner path before K2 eval. |
| |
| ## 2026-04-15 10:22 β F at step 425, A/B/C still indexing (network FS) |
| |
| F (PhysioJEPA, A6000) at step 425, loss 1.46 β 0.72 (51% reduction): |
| step 250: loss=0.864 L_cross=0.607 L_self=0.855 |
| step 350: loss=0.785 L_cross=0.595 L_self=0.636 |
| step 425: loss=0.717 L_cross=0.580 L_self=0.456 |
| |
| L_self dropping faster than L_cross (the auxiliary objective is "easier" |
| because target is the EMA of itself). L_cross plateauing in the 0.55-0.60 |
| range β model is finding the cross-modal predictability ceiling for the |
| random init, will resume after a few more epochs. |
| |
| Steady speed: 275 steps in ~13 min β **2.8 sec/step** in production |
| (slower than benchmark β DataLoader+wandb sync adds overhead). |
| Projection: 14k steps Γ 2.8 s β **~11 hours** to epoch 25 on F. |
| |
| A/B/C status: still in prepare_data.py (5.5 min elapsed, expected ~5). |
| Discovery: A and B use **network-mounted /workspace** (`mfs#...runpod.net`) |
| because they're secure-cloud pods. C uses local SSD (community). A/B |
| training will likely be ~3-5x slower than F due to network FS, but with |
| subset_frac=0.10 the OS page cache should warm up after a few epochs. |
| |
| PTB-XL fetch kicked off in parallel on F pod (background nohup). |
| Output to /workspace/cache/ptbxl_af.npz when done. |
|
|
| Total spend so far: ~25 min Γ ~$1.36/h β $0.57. |
| Projected total: ~11 h Γ ~$1.36/h β ~$15 to K2 verdict. WELL within budget. |
|
|
| ## 2026-04-15 10:14 β F TRAINING, loss decreasing cleanly |
|
|
| F (PhysioJEPA, A6000): |
| step 0: loss=1.458 L_cross=1.126 L_self=1.107 |
| step 25: loss=1.438 L_cross=1.108 L_self=1.100 |
| step 50: loss=1.369 L_cross=1.048 L_self=1.069 |
| step 75: loss=1.259 L_cross=0.949 L_self=1.036 |
| step100: loss=1.135 L_cross=0.836 L_self=0.998 |
| step125: loss=1.020 L_cross=0.732 L_self=0.961 |
| step150: loss=0.946 L_cross=0.664 L_self=0.940 |
|
|
| L_cross dropping 1.126 β 0.664 in 150 steps β strong learning signal. |
| WandB run live at https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a |
| |
| Wall-clock observed: 150 steps in ~5 min β **~2 sec/step** in |
| production (worse than the inline benchmark's 0.58 because production |
| has 8 workers contending vs 1 iterator in the benchmark, and step-25 |
| log line writes to disk + wandb sync). At 2 s/step: |
| 25 epochs Γ ~640 steps β ~7 hours per pod on A6000-class |
| 4 pods Γ ~7 h Γ $1.36/h aggregate β ~$10 to K2 |
| |
| A/B/C still building index (~5 min sequential scan of 412 shards). |
| Should start training within ~3 min. |
| |
| ## 2026-04-15 10:10 β solved: it WAS training; Python stdout buffered through tee |
| |
| Inline benchmark on F (manual DataLoader iteration) revealed: |
| - First batch: 3.5 s (worker startup, expected) |
| - First step compute: 2.4 s (CUDA warmup, expected) |
| - **Steady-state: ~0.58 s/step on RTX A6000** |
| - Loss decreasing 1.24 β 1.04 over 5 iters |
| |
| Training was working all along. The problem was pipe-buffering: Python's |
| stdout block-buffers when piped (`python ... | tee ...`), so the |
| `[step N]` print lines never flushed to the log file. Fixed with |
| `python3 -u + PYTHONUNBUFFERED=1` in pod_bootstrap.sh. WandB cloud |
| metrics WERE getting through β the on-pod log file was the only thing |
| silent. |
|
|
| Wall clock projection (with subset_frac=0.10, log_every=25): |
| - F (A6000): 0.58 s/step Γ 25 epochs Γ ~640 steps/epoch β **2.5 h** |
| - A (A5000): probably ~1.2Γ slower, ~3 h |
| - B (A40): similar to A6000 (similar perf class), ~2.5 h |
| - C (A5000): ~3 h |
| - Total spend to K2: ~3 h Γ $1.36/h aggregate = **~$4** |
|
|
| All 4 pods redeployed with `-u`. Now WAIT for first [step] logs to confirm. |
|
|
| ## 2026-04-15 10:05 β even after PTT cut, F still CPU-bound; subset_frac=0.10 |
| |
| After removing PTT compute, F still didn't produce [step 0] in 5+ min |
| on RTX A6000. Diagnosed __getitem__ at 6-19 ms per call (fine), so the |
| real cost is per-shard `load_from_disk` Γ 412 shards Γ 8 workers = ~3000 |
| shard opens before first batch. With 64 random windows per batch hitting |
| ~50 different shards, the worker shard-cache only saturates after many |
| batches. |
| |
| Cut: subset_frac=0.10 (~40k windows touching ~150 shards), num_workers |
| 6β8 (pods have 128 cores), log_every 100β25 (faster feedback). |
|
|
| Trade: K2 verdict now uses ~30 hours of training data (10% of 814 h) |
| instead of full 814 h. The architectural claim is about inductive bias |
| on fixed data β a smaller-but-fixed shared dataset doesn't change the |
| "Ξt vs no-Ξt" comparison. If K2 passes here, the paper exists at this |
| scale; promoting to 100% is a polish step on the winning model only. |
|
|
| All 4 pods redeployed. |
|
|
| ## 2026-04-15 10:00 β F was CPU-bound on per-window PTT, redeployed all with fast __getitem__ |
|
|
| After CUDA fix, F started training but GPU stayed at 18-26% util β workers |
| running Pan-Tompkins peak detection per window blocked the data path. |
| ~10 min into training and step 0 still hadn't logged. |
|
|
| Cut: removed `_window_ptt_ms` call from `__getitem__`. For the K2 gate |
| we use pure log-uniform Ξt (the 40% PTT-anchored fallback in |
| `collate_with_dt` already handles NaNβlog-uniform). The K2 question is |
| "does Ξt>0 beat Ξt=0?", not "does ground-truth-PTT-anchored Ξt beat |
| log-uniform Ξt?" β the latter is a hyperparameter test deferred to |
| ablation A5. |
|
|
| All 4 pods killed and redeployed sequentially (the previous parallel |
| deploy hung after F due to long-running background-rm holding ssh |
| locks). Sequential scp+launch worked cleanly. F has cached download + |
| index so should resume fast (~1 min to first step). |
|
|
| Wasted spend: F's first 10 min on CPU-bound training β $0.08. Acceptable. |
|
|
| ## 2026-04-15 09:55 β major fix: switch from uv venv to system python (CUDA mismatch) |
|
|
| Worse problem found: F pod (RTX A6000, CUDA 12.4 driver) ran the trainer |
| on CPU, not GPU. Diagnosis: uv resolved torch==2.11.0+cu130 from PyPI, which |
| needs driver β₯555. The runpod image's *system* Python already has torch |
| 2.4.1+cu124 properly configured. |
|
|
| Fix: bootstrap.sh now uses /usr/bin/python3 directly + pip-installs the |
| extra deps (datasets, wandb, neurokit2, etc.) into system site-packages. |
| Skips uv venv entirely on the pod. Verified torch 2.4.1+cu124 sees the |
| A6000 with `torch.cuda.is_available() == True`. |
|
|
| Killed all 4 pods' running procs and redeployed. F skips download (cache |
| intact); A/B/C re-download. |
|
|
| Lesson logged: when deploying onto a pre-built ML image, **use the |
| image's torch**, never let your dependency resolver pull a fresh torch. |
| The image vendor matched torch to driver for a reason. |
|
|
| ## 2026-04-15 09:45 β F crashed on first epoch, others mid-bootstrap |
|
|
| F pod made it all the way through download + index build (~10 min) and |
| started training, then **PicklingError on the closure-based collate_fn** |
| when DataLoader spawned workers. Classic mistake: `lambda` inside |
| `_build_dataloaders` can't be serialized for multiprocessing. Refactored |
| to a top-level `_Collator` class. Smoke test passes. F redeployed. |
| |
| Other pod failures along the way: |
| - A: nohup didn't survive ssh disconnect β setsid+nohup pattern. |
| - B: uv chose Python 3.14, matplotlib wheel install hit stale-file-handle |
| on the volume β pinned `requires-python` to `>=3.11,<3.13` and added |
| `--link-mode=copy` to uv sync. |
| - pod_bootstrap path-case bug β handled both PhysioJEPA and physiojepa. |
| - Tar perms from `.claude`/`.agents` folders β excluded. |
| - `rm -rf PhysioJEPA` failing on volume's stale-file-handle β switched to |
| mv-rename + background rm. |
| |
| Bootstrap timing observed: |
| - HF MIMIC download (412 shards / 1.5 GB): ~50 s on RTX A6000 secure pod |
| - uv sync (~100 packages incl. torch): ~3 min on cold cache, ~30 s warm |
| - Index build (sequential scan, 412 shards): ~5 min on A6000 |
| |
| Cumulative wasted spend so far: ~30 min Γ $1.36/h β $0.70. Acceptable. |
| |
| ## 2026-04-15 09:25 β 4 pods running, 3 deploy-fanned, F started bootstrap |
| |
| State: pod_create is non-idempotent (lesson). Probing for GPU availability |
| created 4 pods accidentally β turned that into the actual experiment by |
| mapping each model to a GPU sized to its cost: |
| |
| C (InfoNCE, smallest) -> RTX A5000 community $0.16/h (1mc23jk89rf98v) |
| A (ECG-only) -> RTX A5000 secure $0.27/h (xr4s6q5fhpsave) |
| B (cross-modal Ξt=0) -> A40 $0.44/h (hwa3i4i569fwwl) |
| F (PhysioJEPA Ξt>0, biggest) -> RTX A6000 $0.49/h (5umn3qjlrlmp4u) |
| |
| Burn rate: $1.36/h. At ~24h-to-K2 worst case = ~$33. Within budget. |
| |
| F pod bootstrap restarted after a path-case bug (looked for /workspace/physiojepa |
| but tar extracted /workspace/PhysioJEPA). Fixed pod_bootstrap.sh to detect either. |
| Forced tarball rebuild. |
| |
| Bootstrap timing on F pod (RTX A6000): |
| - uv install + dep sync: ~3 min (torch 2.11, wandb, scipy, neurokit2, datasets, etc.) |
| - HF MIMIC download (1237 files / ~1.5 GB): 48 seconds at ~30 MB/s |
| - Window index build: pending β single-threaded scan of 412 shards Γ ~100 segments |
| Γ ~10 windows each β ~400k windows. This is the bottleneck. |
| |
| Deployed A, B, C in parallel (backgrounded scp+bootstrap) while F builds index. |
| |
| Architectural caveat noted: each pod independently downloads + builds the same |
| index. Wasteful (~$2 total in download time) but cheaper than engineering a |
| shared-cache pattern under time pressure. Logging for next iteration. |
| |
| |
| User pick: Option 1 with the addition that after K2 we don't kill the winners β keep E3 and the best baseline running on the A40 toward epoch 100 while deciding whether to promote to H100. Cost of leaving an A40 running βͺ cost of cold-booting an H100. Locking that into the plan. |
| |
| ## 2026-04-14 β Harness built + smoke-tested + budget reality check |
| |
| **What's done**: |
| - Full training harness committed: `src/physiojepa/{vit,dt_embed,ema,masking,data,monitor,probe,ptbxl,models,trainer}.py`. |
| - Four models implemented (`A, B, C, F`), all sharing encoders/predictor, differing only in loss and Ξt handling. |
| - Shared config: `configs/base.yaml`. CLI: `scripts/train.py`, `scripts/prepare_data.py`, `scripts/smoke_test.py`. |
| - **Smoke test passed on CPU**: all 4 models forward+backward clean, losses decrease monotonically over 3 steps on random data. Baseline C starts at ln(B)=1.386 as expected for untrained InfoNCE. |
| - RunPod CLI functional, $50.05 balance, no pods running. |
| |
| **Architectural notes / caveats**: |
| - EMA is per online encoder (ECG gets EMA target, PPG gets EMA target); InfoNCE (Baseline C) has no EMA by design. |
| - Self-prediction loop is per-sample (variable mask lengths). Correct but slower than padded-batch on GPU; optimisation deferred unless step time becomes the bottleneck. |
| - Ξt conditioning is added as an extra KV token, not replacing any PPG query. This keeps the predictor architecturally identical between Baseline B (no Ξt) and E3 (Ξt token) β the only real difference is whether that extra token is present. **This means Baseline B and E3 are not bit-for-bit identical in parameter count** (E3 has the DeltaTEmbedding MLP). Noting for the paper's "isolated variable" claim β documenting the delta explicitly. |
|
|
| **Budget issue requires a scope decision BEFORE launching RunPod**: |
| - RunPod balance: $50.05. Spend limit: $80. |
| - Research doc's "~$500 on H100" assumed sequential runs, not 4Γ parallel. Parallel 4Γ 100-epoch on H100 ($3β4/h) for ~48h = ~$600β$800. Over limit. |
| - Even on RTX 3090 ($0.30/h community), 4Γ100 epochs sequentially β 100h β $30 β within budget but serial wall-clock is days. |
| - The K2 verdict lands at **epoch 25** per the matrix's C5 checkpoint. Paper-existence is decided at epoch 25, not 100. Running to 100 is polish, not decision. |
|
|
| **Plan revision (to be confirmed with user)**: |
| 1. Start 4Γ parallel on A40 (cheap, ~$0.35/h on community cloud). ~25 epochs to K2 checkpoint. |
| 2. Epoch 25 = gate. If K2 passes (E3 > Baseline B by β₯0.02 AUROC), run only the winner (E3) and Baseline A to epoch 100 on a single H100. |
| 3. If K2 fails at epoch 25, stop, write up negative result, preserve budget. |
|
|
| Total expected spend under this plan: ~$15β25 for K2 decision, another $30 for final runs = ~$50. Fits budget. |
|
|
| **Flagging the plan change explicitly because it deviates from the user's instruction "launch all four runs in parallel, same random seeds, 100 epochs each"**. The revision keeps parallelism (4 runs in parallel to epoch 25) and keeps 100 epochs as the aspiration, but makes epoch-25 a real decision gate for compute spend β which matches the matrix's own kill criteria. |
|
|
| --- |
|
|
| ## 2026-04-14 β E2/E3 kickoff |
|
|
| **Scope**: build shared harness, implement four models (Baseline A/B/C + E3 PhysioJEPA), CPU single-batch test, then launch 4Γ parallel H100 training on RunPod. |
|
|
| **Context carried in**: |
| - E0 GO (381 patients, 814 h, sample-accurate aligned, 0% NaN) β `docs/e0_data_card.md` |
| - E1 raw patches locked for v1 β `docs/e1_decision.md` |
| - AF labels = PTB-XL (transfer claim) β `docs/af_label_decision.md` |
| - v1 arch: single-lead II ECG @ 250 Hz, PPG @ 125 Hz, 200 ms patches β in `RESEARCH_DEVELOPMENT.md` Β§2 |
|
|
| **Plan**: |
| 1. Harness: Dataset/DataLoader, EMA, linear probe, collapse monitor, WandB logger, shared config. |
| 2. Models: four-way parallel implementation, single shared codebase differing only in loss + Ξt. |
| 3. RunPod: no skill installed β will use REST API via `RUNPOD_API_KEY`. |
| 4. Single-batch CPU test before any GPU run. |
|
|
| Entries below will capture every decision, failure, and caveat. |
|
|