PhysioJEPA / docs /RESEARCH_LOG.md

Upload folder using huggingface_hub

31e2456 verified 25 days ago

43.3 kB

	# PhysioJEPA research log
	Running narrative — newest entries at top.

	Format: each entry is `## YYYY-MM-DD HH:MM — [PHASE] — topic` followed by bullet list of what was done, what was found, and any decisions/caveats.

	---

	## 2026-04-16 09:35 — definitive run: all 3 pods bootstrapping

	All 3 definitive-run pods deployed:

	F: H100 PCIe secure ($2.39/h) @ 216.81.245.97:18654 — still in index build
	A: A100 SXM comm ($1.39/h) @ 216.249.100.66:20011 — in precompute (454k windows)
	B: A100 SXM secure ($1.49/h) @ 154.54.102.26:17999 — just started pip install

	Config: 100 epochs, full data (subset_frac=1.0 via fast_cache_dir mmap),
	mask_ratio=0.75, batch_size=64, seed=42, num_workers=12.

	Aggregate: $5.27/h. Balance: $118.90. At 20h projected = $105.

	Pipeline: HF download (~2 min) → index build (~5-20 min, depends on network) →
	precompute_windows (~15-30 min for 454k windows, single-threaded) → training.

	A is furthest along (precompute started). F is behind (slower download).
	B just started. First [step 0] expected in ~30 min from A.

	## 2026-04-16 04:40 — full-scale run scoping: need data pipeline optimization first

	User requested 3× H100, full data, 100 epochs, mask=0.75. Budget check:
	- Balance: $118.90. H100 PCIe community: $1.99/h × 3 = $5.97/h.
	- Steps: ~6160/epoch × 100 = 616k per run.
	- sec/step on A40 was 2.8 (production) vs 0.58 (benchmark). Even on H100
	with faster CPU, realistic production sec/step is ~1.0-1.5.
	- At 1.2 sec/step: 616k × 1.2 / 3600 = 205h per run × 3 runs × $2/h = $1230. WAY over budget.

	Root cause: __getitem__ calls load_from_disk per-shard + bandpass + zscore
	per window at runtime. This dominates training time by 5× over GPU forward.

	Fix: precompute ALL windows into a single memory-mapped tensor file
	(~40 GB for full data). __getitem__ becomes a single mmap read (~0.1ms).
	sec/step drops to ~0.3, bringing total runtime to ~51h across 3 A100 runs
	= ~$100. Fits budget.

	Building the precompute script now.

	## 2026-04-16 04:25 — FINAL: abl3 ep25 = 0.848, all pods killed

	abl3 (mask=0.75, unimodal A) epoch 25 AUROC = 0.848.

	Complete results table:

	\| Model \| mask \| L_self peak \| ep5 \| ep10 \| ep15 \| ep20 \| ep25 \|
	\|------------------\|------\|-------------\|-------\|-------\|-------\|-------\|-------\|
	\| original A \| 0.50 \| 0.476 \| 0.783 \| 0.736 \| — \| — \| 0.703 \|
	\| abl1 (pd=1) \| 0.50 \| 0.438 \| — \| — \| 0.749 \| — \| — \|
	\| abl2 (sin-q) \| 0.50 \| 0.559 \| — \| — \| 0.784 \| — \| — \|
	\| abl3 (m=75) \| 0.75 \| 0.200 \| — \| — \| 0.838 \| 0.845 \| 0.848 \|
	\| abl4 (full data) \| 0.50 \| 0.587+ \| — \| — \| — \| — \| (killed; spike confirmed) \|
	\| B (Δt=0) \| — \| — \| 0.660 \| 0.844 \| — \| — \| 0.847 \|
	\| F (Δt>0) \| — \| — \| 0.652 \| 0.859 \| — \| — \| 0.835 \|

	abl3 (0.848) ≈ B (0.847). Unimodal JEPA with 75% masking exactly
	matches cross-modal JEPA. The mechanism story is complete.

	abl4 (full data, 50% mask) showed L_self spike peaking at 0.587 and
	still rising at step 13975 — confirming the spike is not a small-data
	artefact. Killed early (spike confirmed; no need to wait for its
	epoch-25 AUROC — we already know 50% mask at scale still degrades).

	All pods killed. Zero stale compute. Total ablation spend: ~$4.50.

	## 2026-04-16 03:10 — AUROC confirms mechanism end-to-end

	Epoch-15 AUROC on PTB-XL AF:

	\| variant \| L_self peak \| AUROC @ ep15 \|
	\|-----------------\|-------------\|--------------\|
	\| original A \| 0.476 \| 0.736 \|
	\| abl1 (pd=1) \| 0.438 \| 0.749 \|
	\| abl2 (sin-q) \| 0.559 \| 0.784 \|
	\| abl3 (m=75) \| 0.196 \| 0.838 \|
	\| (ref) B ep10 \| — \| 0.844 \|
	\| (ref) F ep10 \| — \| 0.859 \|

	abl3 matches B/F's AUROC at epoch 15. Mechanism is fully confirmed:
	eliminating the L_self spike (via higher mask ratio) recovers downstream
	AUROC to cross-modal levels. Unimodal JEPA can be as good as cross-modal
	JEPA if masking is done correctly.

	Subtle finding from abl2: sinusoidal query has a LARGER L_self spike
	(0.559 vs orig 0.476) but HIGHER AUROC (0.784 vs 0.736). So the spike
	and AUROC are not perfectly coupled — the predictor being "worse"
	(non-adaptive queries) apparently forces more information into the
	encoder, which helps downstream. Noting as an interesting secondary
	finding, but abl3 is the main story.

	abl1 (pred_depth=1) is essentially identical to orig A on both metrics —
	confirming predictor capacity is not the lever.

	### Paper now has a clean, precise story

	1. Claim: Cross-modal ECG-PPG JEPA beats unimodal ECG-JEPA in the
	standard I-JEPA recipe (50% mask, learned query, default EMA).
	2. Mechanism: at 50% mask the predictor finds a local-interpolation
	shortcut (25 visible context ↔ 25 target contiguous blocks → linear
	blend of adjacent patches works). Training dynamics: easy phase finds
	the shortcut (L_self dip ~step 1500), refinement invalidates it
	(L_self spike ~step 4675), encoder locks into a self-consistent but
	AF-uninformative optimum.
	3. Fixes: (a) mask ratio 0.75 denies the shortcut structurally — abl3
	matches cross-modal AUROC. (b) Cross-modal prediction is the same
	mechanism — 0% PPG visible context → no interpolation path — F and B
	both stable.
	4. Δt direction doesn't matter (K2 fail is a negative result that
	supports the mechanism: the Δt token is a tiny perturbation of the
	predictor's query set; what matters is whether interpolation is
	available, not where the targets sit on the time axis).

	Actionable recommendation: ECG-JEPA (Weimann & Conrad) used 50% masking.
	75% masking is a likely-free improvement, testable on PTB-XL directly.

	### Status

	- abl1 + abl2 pods killed. Answered their questions.
	- abl3 running to epoch 25 for the final number. ~1 h left at $0.44/h.
	- abl4 (full data) at step 9975 with L_self=0.54 — **spike IS present
	at full data**, just delayed. More data slows shortcut discovery but
	doesn't eliminate it. Confirms mask ratio is the architectural fix,
	not a small-data artifact.
	- abl4 still has ~20h to go. Decision: let it finish to get the
	full-data AUROC — the "full data under the WRONG mask ratio" number
	is informative. At $0.44/h × 20h = $8.80. Still well under budget.

	## 2026-04-16 02:05 — mask_ratio IS the lever (spike window confirmed)

	Full matrix at the critical spike window (original A peaks L_self=0.476 at step 4675):

	step \| orig A \| abl1 (pd=1) \| abl2 (sin-q) \| abl3 (m=75) \| abl4 (full)
	------+--------+-------------+--------------+-----------------+------------
	1475 \| 0.220 \| 0.222 \| 0.329 \| 0.146 \| 0.296
	2475 \| 0.340 \| 0.339 \| 0.482 \| 0.165 \| 0.233
	3475 \| 0.442 \| 0.420 \| 0.555 \| 0.186 \| 0.208
	4475 \| 0.476 \| 0.438 \| 0.559 \| 0.196 \| 0.260
	4975 \| 0.475 \| 0.398 \| 0.551 \| 0.200 \| 0.287
	5475 \| — \| 0.334 \| 0.512 \| — \| 0.313

	abl3 (mask 0.75) has NO spike. L_self rises monotonically from 0.146
	(step 1475) to 0.200 (step 4975) — a gentle climb of +0.05 over 3500 steps,
	vs orig A's explosive +0.26 peak.

	abl1 (pred_depth=1) tracks orig A. Predictor capacity is not the lever.

	abl2 (sinusoidal queries) has a LARGER spike than orig A (0.559 peak vs
	0.476). Removing the adaptive query hurts — the predictor can't route
	context tokens to targets it cares about.

	abl4 (full data) shows a muted spike (0.208 → 0.313 over 2000 steps).
	10× data slows shortcut discovery but doesn't eliminate it. Suggests scale
	helps but mask_ratio is the cleaner fix.

	### Revised mechanism — unified story

	50% masking gives the predictor 25 target patches and 25 visible context
	patches arranged in contiguous blocks. Early training, the predictor
	learns a short-range interpolation shortcut: predict masked patch `p` as
	a linear blend of adjacent visible patches. This gives a low L_self quickly
	(dip at step 1500). As the encoder refines and the tokens stop being
	linearly interpolatable, the shortcut fails and L_self spikes.

	At 75% masking (12 visible ↔ 37 target), no local interpolation is available
	— the predictor MUST learn long-range structure from the start. No dip,
	no rebound.

	Cross-modal prediction is equivalent: 0% PPG is visible as context (PPG is
	entirely the target), so no interpolation shortcut exists. F and B dodge
	the spike by the same mechanism as abl3.

	Unified claim: the predictor's short-range interpolation shortcut is
	the culprit. Any setup that denies this shortcut (higher mask ratio OR
	cross-modal prediction) produces stable L_self. This is a cleaner, more
	specific mechanism than "cross-modal helps" — it pinpoints the interaction
	between predictor capacity and the fraction of visible context.

	### Next test: AUROC recovery

	Does abl3's no-spike training actually produce better AF representations?
	Kicked off PTB-XL fetch on abl3 pod in parallel with training. Will probe
	all 4 ablation ckpts once training completes (~2-3 h).

	Prediction: if the mechanism story is correct,
	abl3 AUROC @ ep25 > orig A's 0.703, should approach F/B's 0.83-0.85.

	## 2026-04-16 01:15 — ablation early signal: abl3 (mask 75%) breaks the pattern

	L_self side-by-side at matched steps (only the key ones):

	step \| orig A \| abl1(pd=1) \| abl2(sin-q) \| abl3(m=75) \| abl4(full)
	------+--------+------------+-------------+------------+-----------
	975 \| 0.247 \| 0.248 \| 0.267 \| 0.197 \| 0.390
	1475 \| 0.220 \| 0.223 \| 0.292 \| 0.144 \| 0.285 (interp)
	1775 \| 0.243 \| 0.255 \| 0.371 \| 0.148 \| 0.269
	1975 \| 0.256 \| 0.269 \| 0.403 \| — \| 0.254
	2175 \| 0.283 \| 0.297 \| 0.447 \| — \| 0.230 (interp)

	abl3 (mask 0.75) is markedly different. L_self at step 1775 is 0.148,
	lower than original A's minimum of 0.220. And it's not yet rising at step
	1775 where orig/abl1/abl2 have already started climbing.

	abl1 (pred_depth=1) ≈ orig A. The predictor size was not the driver.

	abl2 (sinusoidal query) is WORSE than orig A. By step 1775 it's at 0.371
	vs orig A at 0.243. Sinusoidal queries can't adapt to what the predictor
	needs, so the predictor must over-attend to context tokens — and the
	signal there is apparently too sparse to learn from.

	abl4 (full data) is descending monotonically at step 1975 (L_self=0.254).
	Too early to say if it avoids the spike — original A's spike was at step 4675.
	Full data is ~10× slower per logical training "epoch" so the spike location
	in wall-clock terms shifts late. Continue monitoring.

	Revised mechanism hypothesis: unimodal JEPA at mask_ratio=0.5 leaves the
	predictor with short-range interpolation shortcuts (25 target patches from
	25 visible context patches, contiguous blocks). Early training finds these
	shortcuts (L_self dips at step 1500). As the encoder refines and
	invalidates the shortcuts, L_self rises. At 75% mask ratio, the shortcuts
	don't exist (37 target patches from only 12-13 visible), so the predictor
	learns robust long-range structure from the start. No dip-and-rebound.

	This is mechanism-specific, falsifiable, and explains both:
	(a) why F/B didn't drift (cross-modal loss provides a diverse, non-local
	target that can't be locally interpolated)
	(b) why abl3 fixed it in unimodal A (higher masking also eliminates the
	local shortcut)

	Now the critical follow-up: does abl3's epoch-25 AUROC match F/B (~0.84)?
	That would complete the mechanism-to-downstream story.

	Cost check: 4×A40×$0.44 × ~45 min = ~$1.32 so far. abl1/2/3 ~3.5 h to go
	(~$5). abl4 ~30 h to go (~$13). Total ~$20 for the suite. Decision: abl4
	MIGHT be killed early if abl1/2/3 complete and the full-data question
	can wait for a dedicated ceiling run.

	## 2026-04-16 00:30 — 4 parallel A ablations launched on A40 secure pods

	To find the real mechanism behind A's degradation, running 4 ablations
	in parallel. Each identical to original A except one variable.

	abl1: pred_depth 4 → 1 (pod 0n8im5mri5hjk0, 69.30.85.78:22121)
	abl2: query_mode learned → sinusoidal (pod a2pye2ki7uvw47, 194.68.245.208:22053)
	abl3: mask_ratio 0.5 → 0.75 (pod jwwln4klav8674, 194.68.245.207:22198)
	abl4: subset_frac 0.10 → 1.00 (pod 4pvp7yb1rmbxta, 194.68.245.207:22197)

	All on A40 secure ($0.44/h × 4 = $1.76/h aggregate). 25 epochs each.
	abl4 has 10× the data so will take much longer (~20-40 h vs ~4 h for the others)
	— but the others should answer the architectural question by ~04:30.

	Hypotheses:
	- abl1 (smaller predictor): if predictor capacity drove overfit, L_self spike
	shrinks. AUROC may improve.
	- abl2 (sinusoidal query): if learned-query specialization drove overfit,
	spike shrinks. AUROC may improve.
	- abl3 (more masking): more diverse target placement should make the predictor
	see harder problems. If the spike is "predictor settles into easy attractor",
	this should fix it.
	- abl4 (full data): if 10% subset was the culprit, spike disappears at scale.
	If still present, it's an architectural issue independent of data scale.

	Spike location to compare against: original A had L_self spike peaking 0.475
	at step 4675 (when τ=0.9999).

	## 2026-04-15 21:59 — slow-τ A ablation RESULT: hypothesis FALSIFIED, pod killed

	Side-by-side L_self at matched steps:

	step \| orig A \| slow-τ A \| orig τ \| slow τ
	------+--------+----------+--------+--------
	1475 \| 0.22 \| 0.22 \| 0.9969 \| 0.9962
	1975 \| 0.26 \| 0.28 \| 0.9974 \| 0.9963
	2975 \| 0.40 \| 0.49 \| 0.9988 \| 0.9967
	3975 \| 0.45 \| 0.60 \| 0.9997 \| 0.9972
	4975 \| 0.47 \| 0.60 \| 0.9999 \| 0.9977
	5475 \| 0.46 \| 0.55 \| 0.9999 \| 0.9979

	Slow-τ A's L_self rose MORE than original A's, not less, despite τ being
	well below saturation through the critical window. The "τ saturation
	amplifies the L_self spike" hypothesis is falsified.

	The L_self rise must be driven by something else. Top candidates:
	1. Masking strategy (multi-block 50% ratio) + small data regime — the
	predictor overfits to easy target patches early (dip at step 1500),
	then the distribution of hard targets dominates as the encoder refines.
	2. Query-embedding parameter specialization — the learnable query tokens
	narrow predictive scope, and random target placement starts hitting
	targets they can't handle.
	3. Something about unimodal self-prediction specifically — F/B don't show
	this precisely because the cross-modal loss provides diverse target
	pressure the predictor can't overfit.

	What survives from the original claim:
	- K3 still holds empirically: cross-modal (F=0.835, B=0.847) >> unimodal
	(A=0.703) at epoch 25.
	- The mechanism story needs replacing. "Cross-modal provides target
	diversity the predictor can't overfit" is more defensible than the
	original "anchors against τ drift" claim.

	Pod y27osaqv7amz7d killed. Ablation cost: ~$0.35 for ~2 h on A5000 community.

	Impact on user's plan:
	- Conditional was: if spike disappears → full-data B run. Spike did not
	disappear. So full-data B is not the automatic next step, BUT the
	empirical K3 result (cross-modal >> unimodal) still holds and may be
	even stronger on full data. Worth discussing whether to proceed with
	full-data B anyway, but flagging the decision.

	## 2026-04-15 21:19 — slow-τ A ablation training (early signal: L_self rising even pre-τ-saturation)

	Slow-τ A early trajectory (log_every=25):
	step 0: L_self = 1.167 (random init)
	step 475: L_self = 0.390
	step 975: L_self = 0.247
	step 1475: L_self = 0.223 ← minimum
	step 1975: L_self = 0.282
	step 2175: L_self = 0.313 ← rising, tau still only 0.9963

	Original A at comparable steps (before any spike):
	step 500: L_self = 0.380
	step 1000: L_self = 0.247
	step 1500: L_self = 0.220 ← minimum
	step 2000: L_self = 0.258
	step 2225: L_self = 0.283

	Slow-τ A is tracking original A essentially step-for-step so far. Both hit
	their minimum ~step 1500, both starting to rise by step 2000. **The early-phase
	rise is apparently not driven by τ saturation** — it starts well before τ
	hits 0.999.

	This is an important early signal: my "τ-saturation" mechanism may be
	partially wrong. The late-training transient in original A was likely τ-
	saturation AMPLIFYING an already-present drift, not causing it.

	Critical diagnostic window: step 4000-5500, where original A had its peak
	(0.48 at step 4675). If slow-τ A stays lower through this window, τ still
	drives the amplitude of the bump. If slow-τ A also spikes at step 4675,
	τ is not the driver.

	## 2026-04-15 20:20 — slow-τ A ablation launched

	Ablation pod: y27osaqv7amz7d (RTX A5000 community, FR). Config:
	ema_end = 0.999 (vs 0.9999 in original)
	ema_warmup_frac = 0.60 (vs 0.30 in original)
	everything else identical: subset_frac=0.10, bs=64, 25 epochs, seed=42

	Prediction:
	- If A spike at step 4675 disappears + AUROC recovers to ~0.84 → τ-saturation
	mechanism is confirmed, cross-modal anchor story holds.
	- If spike disappears BUT AUROC stays at ~0.70 → the original A's problem
	wasn't τ saturation per se; the unimodal objective just doesn't contain
	enough AF-discriminative signal at this data scale.
	- If spike still present → τ schedule isn't the lever; something deeper.

	Conditional on spike disappearing + AUROC recovering, next step is the
	full-data B run (100 epochs, H100, 814h) — the ceiling measurement.

	## 2026-04-15 20:00 — refined mechanism for A degradation (not monotonic drift)

	After pulling full WandB curves, correcting my earlier "A drifts monotonically"
	claim. A actually has:

	- L_self minimum at step 1500 (value 0.22)
	- τ-saturation TRANSIENT at step 4675 (value 0.475) — 3× the bump F/B show
	- recovery by step 7400 (value 0.20)
	- late-training slow climb to 0.20 at step 15350

	F and B also show late-training L_self rise (0.15 → 0.27). Only the
	mid-training transient is unique to A.

	Key finding: A's loss recovers but AUROC doesn't. AUROC dropped from
	0.783 (ep5) → 0.703 (ep25) even though final L_self is comparable to F/B.
	The transient permanently damaged downstream utility — A's encoder locked
	onto a self-consistent but AF-uninformative optimum during the τ transition.

	Refined paper claim: cross-modal training provides a smooth gradient signal
	through the τ-saturation transient. Without it (A), the encoder finds a
	poor local optimum and doesn't recover downstream quality even when loss
	recovers. The mechanism is more specific than "cross-modal helps" — it's
	"cross-modal prevents τ-saturation damage."

	## 2026-04-15 19:30 — FULL K-gate results: K2 FAIL, K3 PASS

	All 4 pods ran to epoch 25. Full probe matrix on PTB-XL AF:

	\| Model \| ep5 \| ep10 \| ep25 \|
	\|-------\|-----\|------\|------\|
	\| F (Δt>0) \| 0.6521 \| 0.8586 \| 0.8352 \|
	\| B (Δt=0) \| 0.6599 \| 0.8440 \| 0.8467 \|
	\| A (uni) \| 0.7832 \| 0.7357 \| 0.7025 \|
	\| C (InfoNCE) \| stuck at ~loss 3.0 — under-tuned baseline, not usable \|

	K2 FAIL: F − B = −0.012 at epoch 25 (target was ≥ +0.02).
	K3 PASS BIG: F − A = +0.133 at epoch 25, and A is DEGRADING.

	Written up in `docs/e2_e3_results.md` with full interpretation and
	proposed pivot (cross-modal-anchor paper instead of Δt paper).

	Spend total: ~$6.14 across 4 pods × ~4.5 h. Vastly under budget.

	Pods still have ckpt_final.pt but training is done. Ready to terminate.

	## 2026-04-15 11:55 — FIRST AUROC: F at epoch 10 = 0.859

	F (PhysioJEPA, Δt>0) AUROC on PTB-XL AF detection:
	epoch 5 (step ~3200): 0.652
	epoch 10 (step ~6400): 0.859 ← latest

	The jump 0.65 → 0.86 in 5 epochs tells us F is rapidly absorbing AF-relevant
	features. Trajectory still climbing — we'd expect further gains by epoch 25.

	Framing correction (user call-out): "approaching Weimann 0.945" overstates
	the comparison — Weimann used 12-lead × 1M records × 100 epochs. F is
	single-lead II × 40k windows × 10 epochs. What matters is the trajectory,
	not the ceiling.

	The probe pipeline had one race condition: probe_when_ready.sh saw the
	ptbxl_af.npz file appear at ~50% (np.savez_compressed wrote non-atomically),
	fired eval_checkpoint.py which tried to unzip an incomplete file — BadZipFile.
	Ran the probe manually once the write finished. Retro fix to
	probe_when_ready.sh would be `[ -f foo ] && file foo \| grep -q Zip` but
	we're past it now.

	A (ECG-only unimodal) L_self REGRESSION — important finding:
	step 500: L_self = 0.380
	step 1000: L_self = 0.247 ← minimum
	step 1500: L_self = 0.220 ← actual minimum
	step 2500: L_self = 0.331
	step 3500: L_self = 0.442
	step 4500: L_self = 0.477 ← now
	step 5000: L_self = 0.472 (tau = 0.9999)

	A is DRIFTING — L_self doubled from 0.22 to 0.47 as EMA τ saturated near 1.0.
	Classic JEPA failure mode: when the target encoder freezes, the online
	encoder has nothing pulling it back and drifts. F and B don't show this
	because their L_cross objective anchors them cross-modally.

	Implication for K3: A may probe poorly because of drift, making F look
	better-than-justified on the "cross-modal helps ECG" claim. Need to note
	this as a limitation in the paper. The honest fix would be a smaller
	final-τ (say 0.999 instead of 0.9999) for A specifically, but we'll note
	and move on for now.

	C (InfoNCE) is NOW LEARNING after the τ fix + passing LR warmup:
	step 0: loss = 4.168 (random)
	step 100: 4.159 (still random)
	step 500: ~3.8 (starting to move)
	step 800: 2.90 ← first clear signal
	step 825: 2.98
	Slow but real. InfoNCE with batch 64 is known-weak (CLIP uses 32k). Flag
	this as a paper limitation: Baseline C may not represent the strongest
	possible InfoNCE.

	State (12:05):
	F: step 7400, L_cross=0.247 (still dropping), epoch-10 ckpt probed → 0.859
	B: step 2250, L_cross=0.401, no ckpt yet (epoch 5 ~ step 3200)
	A: step 4600, L_self=0.464, ckpt_epoch005.pt available
	C: step 825, loss=2.98, climbing out of random

	Now running: PTB-XL fetch_v3 on A, B, C pods in parallel (~10 min).
	Will probe A's ckpt_epoch005.pt the moment npz lands on A pod.

	## 2026-04-15 11:46 — F broke through "0.40 floor" → 0.33; C still stuck (LR warmup)

	F at step 4750: L_cross = 0.327. The earlier "asymptote at 0.40" call
	was wrong twice over — model continued to descend. Trajectory:

	step 1100: 0.419
	step 2150: 0.400
	step 2950: 0.377
	step 4225: 0.384 (oscillating in 0.38-0.40)
	step 4700: 0.374
	step 4750: 0.327 ← clear break-through

	Possible explanation: τ schedule (0.996→0.9999) has nearly completed
	(τ=0.9999 at step 4700+). Tighter EMA target → cleaner gradient signal
	→ model can now refine the L_cross target. This is consistent with
	the published JEPA training dynamics.

	C: still stuck at loss ≈ 4.16 even with fixed τ init. Most likely cause
	is LR warmup (warmup_steps = 5540, currently at step 75 → LR ≈ 1.4e-6).
	Needs another ~500 steps to exit ramp. Will revisit at next check.

	B step 1175: L_cross = 0.459 — slope -0.04 / 100 steps.
	A step 2250: L_self = 0.297.
	PTB-XL fetch: 39%, ETA 24 min.
	Probe waiter: still polling.

	## 2026-04-15 11:30 — F's epoch-5 ckpt landed; B looks competitive; C broken (init bug)

	State:
	- F: step 4225, L_cross=0.384, L_self=0.139, ckpt_epoch005.pt saved.
	- B: step 1000, L_cross=0.499, L_self=0.339 — dropping smoothly.
	- A: step 1850, L_self=0.238 — fast convergence on unimodal task.
	- C: step 225, loss=4.07 (random baseline = ln(64) = 4.158). Bug.

	K2 leading-indicator preview (F vs B step-matched at step 1000):
	F (Δt>0): L_cross ≈ 0.43 (interpolated)
	B (Δt=0): L_cross = 0.499
	Gap = 0.07 — F leads, but B is dropping faster currently.
	K2 jury still out — need B at step 3000+ to see asymptote.

	C bug: init `log_tau = 0` makes the logit-temperature multiplier = 1.0,
	i.e. physical τ = 1.0 (very soft InfoNCE). Standard τ = 0.07 means
	multiplier ≈ 14. Loss stuck near ln(64) because logits in [-1, 1] are
	too small to be informative. Fix: init `log_tau = log(14)`. Will redeploy
	C after F's probe AUROC lands.

	PTB-XL fetch: at 25% download (15k of 43k files via concurrent HTTP).
	ETA ~30 min until npz exists. Probe waiter still polling.

	## 2026-04-15 11:14 — auto-probe armed; PTB-XL switched to LR variant

	User correctly called out two things:
	1. F's L_cross is not at a hard floor — still descending slowly
	(0.001-0.005 per 25 steps). Logged.
	2. Don't interrupt training. Wait for the natural epoch-5 ckpt.

	Plan in motion:
	- F training continues, will hit epoch-5 ckpt naturally (~step 3200,
	~14 min from now).
	- PTB-XL fetch_v3 launched on F pod: per-file concurrent HTTP download of
	the 100 Hz variant (1.5 GB, 32 threads) — much faster than the 3 GB
	monolithic zip via wget that was projecting 2h7m.
	- probe_when_ready.sh waiter armed on F pod: polls run_dir for *.pt and
	ptbxl_af.npz, fires eval_checkpoint.py the moment both exist.
	- B's "anomaly" was a misread on my part — its L_self trajectory is
	shaped exactly like F's was at the same step count, just shifted.

	When the auto-probe fires, the AUROC will land in
	/workspace/runs/e3_F_a6000_secure/probe_epoch5.json.

	## 2026-04-15 11:08 — correction: F's L_cross is STILL descending, not at hard floor

	Earlier read of "L_cross asymptote at ~0.40" was premature. Looking at the
	actual trajectory more carefully:

	step 1100: 0.419
	step 2150: 0.400
	step 2300: 0.392
	step 2750: 0.399
	step 2900: 0.395
	step 2950: 0.377 ← still dropping
	step 2975: 0.389 ← oscillating in the 0.38-0.40 band

	The model is in a slow-descent regime (~0.001 per 25 steps when measured
	over a 100-step window). Not flat. Honest summary: F is near its
	asymptote but hasn't fully reached it. The 0.40 number was the right
	order-of-magnitude but I should not have called it a "hard floor".

	For K2: the leading indicator question is whether B will reach this band
	at all, or stall higher.

	B health check (was flagged as anomalous):
	step 100: L_cross=0.841 L_self=0.997
	step 250: L_cross=0.602 L_self=0.859
	step 525: L_cross=0.588 L_self=0.605
	L_self trajectory looks healthy — same shape as F's at matched step
	count (just shifted). No EMA misconfig evident. The earlier suspicion
	was an over-read.

	A (unimodal, K3 reference):
	step 925: L_self=0.256 (already lower than F's L_self trajectory at
	the same step count). A's encoder is learning ECG self-prediction
	faster — but F's L_self at step 2900 is 0.144, lower still. K3
	comparison needs A to reach step 2900+ for a fair shot.

	Probe plan: wait for F's natural epoch-5 ckpt (~14 min from now =
	~step 3200). Then linear probe vs PTB-XL AF.

	PTB-XL fetch: wget download is at 71 MB / 3 GB at 200 KB/s — ETA 2h7m.
	Too slow. Need to cancel + use a different mirror.

	## 2026-04-15 10:58 — F at L_cross=0.40 plateau; B chasing; A unimodal also at ~0.42

	WandB runs (all live):
	F (PhysioJEPA): https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a
	A (ECG-only): https://wandb.ai/guy-na8/physiojepa/runs/t9486rf9
	B (Δt=0): https://wandb.ai/guy-na8/physiojepa/runs/9gwflgr5
	C (InfoNCE): https://wandb.ai/guy-na8/physiojepa/runs/unfs8uzf

	Step-matched comparison at step 250 (both still in warmup):
	F (Δt>0): loss=0.864 L_cross=0.607 L_self=0.855
	B (Δt=0): loss=0.860 L_cross=0.602 L_self=0.859
	A (uni): loss=0.546 L_cross=0 L_self=0.546

	Identical Δt-vs-no-Δt at step 250 — confirming warmup phase predictions.

	F's L_cross trajectory (now at step 2325):
	step 1100: 0.419
	step 1500: 0.408 (interpolated)
	step 2150: 0.400 ← inflection
	step 2300: 0.392 (very slowly continuing to drop)
	step 2325: 0.401 (oscillating)

	F's L_cross has converged to ~0.40 ± 0.02. This is the asymptote.
	1200 steps of training without further drop. Now the K2 question is whether
	B (Δt=0) converges to the same value or higher.

	F's L_self (auxiliary) at step 2325 = 0.147; A's L_self at step 425 = 0.42.
	Comparing at step 425 only: A's L_self is 0.42, F's was ~0.55 at the same
	step count — A is decreasing faster early. Need to wait for A to catch up
	to step 2000+ for fair K3 comparison.

	PTB-XL: relaunched fetch with v2 script (wget full zip, mp.Pool 16 workers).
	Should complete in ~10 min vs the 2 h v1 was projecting.

	Total spend so far: ~80 min × $1.36/h ≈ $1.81. K2 ETA ~10 hours from now.

	## 2026-04-15 10:36 — A/B/C unblocked via index-copy from F; F at step 1125

	A/B/C had been stuck in `prepare_data.py` for 27 min — the network FS on
	A and B (mfs#runpod.net) makes the per-shard load_from_disk pathological.
	Killed prepare_data on all 3, scp'd F's already-built `mimic_index.json`
	(48 MB) to each, then launched training directly.

	Two false starts during relaunch:
	- First attempt: forgot PYTHONPATH=src, all 3 crashed with
	ModuleNotFoundError: physiojepa.
	- Second attempt: setsid stripped the env, C crashed again. Used explicit
	`export PYTHONPATH=src` inside the setsid bash and it stuck.

	All 4 now training. Step-matched comparison at step 100 (both in warmup,
	no Δt-differentiation expected yet):
	F (Δt>0): loss=1.135 L_cross=0.836 L_self=0.998
	B (Δt=0): loss=1.140 L_cross=0.841 L_self=0.997
	A (uni): loss=0.834 L_self=0.834

	Identical so far. Real K2 leading-indicator window is around L_cross ≈ 0.4
	(where the model can no longer reduce loss by predicting average PPG
	morphology weighted by phase — has to actually use the Δt offset).
	F currently at step 1125, L_cross=0.418 — entering that boundary now.

	PTB-XL fetch: killed. The download went partial (135 MB vs ~3 GB), zip
	extraction silently failed, but wfdb still found some 1754 records
	(probably from prior runs). Will set up via cleaner path before K2 eval.

	## 2026-04-15 10:22 — F at step 425, A/B/C still indexing (network FS)

	F (PhysioJEPA, A6000) at step 425, loss 1.46 → 0.72 (51% reduction):
	step 250: loss=0.864 L_cross=0.607 L_self=0.855
	step 350: loss=0.785 L_cross=0.595 L_self=0.636
	step 425: loss=0.717 L_cross=0.580 L_self=0.456

	L_self dropping faster than L_cross (the auxiliary objective is "easier"
	because target is the EMA of itself). L_cross plateauing in the 0.55-0.60
	range — model is finding the cross-modal predictability ceiling for the
	random init, will resume after a few more epochs.

	Steady speed: 275 steps in ~13 min ≈ 2.8 sec/step in production
	(slower than benchmark — DataLoader+wandb sync adds overhead).
	Projection: 14k steps × 2.8 s ≈ ~11 hours to epoch 25 on F.

	A/B/C status: still in prepare_data.py (5.5 min elapsed, expected ~5).
	Discovery: A and B use network-mounted /workspace (`mfs#...runpod.net`)
	because they're secure-cloud pods. C uses local SSD (community). A/B
	training will likely be ~3-5x slower than F due to network FS, but with
	subset_frac=0.10 the OS page cache should warm up after a few epochs.

	PTB-XL fetch kicked off in parallel on F pod (background nohup).
	Output to /workspace/cache/ptbxl_af.npz when done.

	Total spend so far: ~25 min × ~$1.36/h ≈ $0.57.
	Projected total: ~11 h × ~$1.36/h ≈ ~$15 to K2 verdict. WELL within budget.

	## 2026-04-15 10:14 — F TRAINING, loss decreasing cleanly

	F (PhysioJEPA, A6000):
	step 0: loss=1.458 L_cross=1.126 L_self=1.107
	step 25: loss=1.438 L_cross=1.108 L_self=1.100
	step 50: loss=1.369 L_cross=1.048 L_self=1.069
	step 75: loss=1.259 L_cross=0.949 L_self=1.036
	step100: loss=1.135 L_cross=0.836 L_self=0.998
	step125: loss=1.020 L_cross=0.732 L_self=0.961
	step150: loss=0.946 L_cross=0.664 L_self=0.940

	L_cross dropping 1.126 → 0.664 in 150 steps — strong learning signal.
	WandB run live at https://wandb.ai/guy-na8/physiojepa/runs/m0cdwa8a

	Wall-clock observed: 150 steps in ~5 min ≈ ~2 sec/step in
	production (worse than the inline benchmark's 0.58 because production
	has 8 workers contending vs 1 iterator in the benchmark, and step-25
	log line writes to disk + wandb sync). At 2 s/step:
	25 epochs × ~640 steps ≈ ~7 hours per pod on A6000-class
	4 pods × ~7 h × $1.36/h aggregate ≈ ~$10 to K2

	A/B/C still building index (~5 min sequential scan of 412 shards).
	Should start training within ~3 min.

	## 2026-04-15 10:10 — solved: it WAS training; Python stdout buffered through tee

	Inline benchmark on F (manual DataLoader iteration) revealed:
	- First batch: 3.5 s (worker startup, expected)
	- First step compute: 2.4 s (CUDA warmup, expected)
	- Steady-state: ~0.58 s/step on RTX A6000
	- Loss decreasing 1.24 → 1.04 over 5 iters

	Training was working all along. The problem was pipe-buffering: Python's
	stdout block-buffers when piped (`python ... \| tee ...`), so the
	`[step N]` print lines never flushed to the log file. Fixed with
	`python3 -u + PYTHONUNBUFFERED=1` in pod_bootstrap.sh. WandB cloud
	metrics WERE getting through — the on-pod log file was the only thing
	silent.

	Wall clock projection (with subset_frac=0.10, log_every=25):
	- F (A6000): 0.58 s/step × 25 epochs × ~640 steps/epoch ≈ 2.5 h
	- A (A5000): probably ~1.2× slower, ~3 h
	- B (A40): similar to A6000 (similar perf class), ~2.5 h
	- C (A5000): ~3 h
	- Total spend to K2: ~3 h × $1.36/h aggregate = ~$4

	All 4 pods redeployed with `-u`. Now WAIT for first [step] logs to confirm.

	## 2026-04-15 10:05 — even after PTT cut, F still CPU-bound; subset_frac=0.10

	After removing PTT compute, F still didn't produce [step 0] in 5+ min
	on RTX A6000. Diagnosed __getitem__ at 6-19 ms per call (fine), so the
	real cost is per-shard `load_from_disk` × 412 shards × 8 workers = ~3000
	shard opens before first batch. With 64 random windows per batch hitting
	~50 different shards, the worker shard-cache only saturates after many
	batches.

	Cut: subset_frac=0.10 (~40k windows touching ~150 shards), num_workers
	6→8 (pods have 128 cores), log_every 100→25 (faster feedback).

	Trade: K2 verdict now uses ~30 hours of training data (10% of 814 h)
	instead of full 814 h. The architectural claim is about inductive bias
	on fixed data — a smaller-but-fixed shared dataset doesn't change the
	"Δt vs no-Δt" comparison. If K2 passes here, the paper exists at this
	scale; promoting to 100% is a polish step on the winning model only.

	All 4 pods redeployed.

	## 2026-04-15 10:00 — F was CPU-bound on per-window PTT, redeployed all with fast __getitem__

	After CUDA fix, F started training but GPU stayed at 18-26% util — workers
	running Pan-Tompkins peak detection per window blocked the data path.
	~10 min into training and step 0 still hadn't logged.

	Cut: removed `_window_ptt_ms` call from `__getitem__`. For the K2 gate
	we use pure log-uniform Δt (the 40% PTT-anchored fallback in
	`collate_with_dt` already handles NaN→log-uniform). The K2 question is
	"does Δt>0 beat Δt=0?", not "does ground-truth-PTT-anchored Δt beat
	log-uniform Δt?" — the latter is a hyperparameter test deferred to
	ablation A5.

	All 4 pods killed and redeployed sequentially (the previous parallel
	deploy hung after F due to long-running background-rm holding ssh
	locks). Sequential scp+launch worked cleanly. F has cached download +
	index so should resume fast (~1 min to first step).

	Wasted spend: F's first 10 min on CPU-bound training ≈ $0.08. Acceptable.

	## 2026-04-15 09:55 — major fix: switch from uv venv to system python (CUDA mismatch)

	Worse problem found: F pod (RTX A6000, CUDA 12.4 driver) ran the trainer
	on CPU, not GPU. Diagnosis: uv resolved torch==2.11.0+cu130 from PyPI, which
	needs driver ≥555. The runpod image's system Python already has torch
	2.4.1+cu124 properly configured.

	Fix: bootstrap.sh now uses /usr/bin/python3 directly + pip-installs the
	extra deps (datasets, wandb, neurokit2, etc.) into system site-packages.
	Skips uv venv entirely on the pod. Verified torch 2.4.1+cu124 sees the
	A6000 with `torch.cuda.is_available() == True`.

	Killed all 4 pods' running procs and redeployed. F skips download (cache
	intact); A/B/C re-download.

	Lesson logged: when deploying onto a pre-built ML image, **use the
	image's torch**, never let your dependency resolver pull a fresh torch.
	The image vendor matched torch to driver for a reason.

	## 2026-04-15 09:45 — F crashed on first epoch, others mid-bootstrap

	F pod made it all the way through download + index build (~10 min) and
	started training, then PicklingError on the closure-based collate_fn
	when DataLoader spawned workers. Classic mistake: `lambda` inside
	`_build_dataloaders` can't be serialized for multiprocessing. Refactored
	to a top-level `_Collator` class. Smoke test passes. F redeployed.

	Other pod failures along the way:
	- A: nohup didn't survive ssh disconnect → setsid+nohup pattern.
	- B: uv chose Python 3.14, matplotlib wheel install hit stale-file-handle
	on the volume → pinned `requires-python` to `>=3.11,<3.13` and added
	`--link-mode=copy` to uv sync.
	- pod_bootstrap path-case bug → handled both PhysioJEPA and physiojepa.
	- Tar perms from `.claude`/`.agents` folders → excluded.
	- `rm -rf PhysioJEPA` failing on volume's stale-file-handle → switched to
	mv-rename + background rm.

	Bootstrap timing observed:
	- HF MIMIC download (412 shards / 1.5 GB): ~50 s on RTX A6000 secure pod
	- uv sync (~100 packages incl. torch): ~3 min on cold cache, ~30 s warm
	- Index build (sequential scan, 412 shards): ~5 min on A6000

	Cumulative wasted spend so far: ~30 min × $1.36/h ≈ $0.70. Acceptable.

	## 2026-04-15 09:25 — 4 pods running, 3 deploy-fanned, F started bootstrap

	State: pod_create is non-idempotent (lesson). Probing for GPU availability
	created 4 pods accidentally — turned that into the actual experiment by
	mapping each model to a GPU sized to its cost:

	C (InfoNCE, smallest) -> RTX A5000 community $0.16/h (1mc23jk89rf98v)
	A (ECG-only) -> RTX A5000 secure $0.27/h (xr4s6q5fhpsave)
	B (cross-modal Δt=0) -> A40 $0.44/h (hwa3i4i569fwwl)
	F (PhysioJEPA Δt>0, biggest) -> RTX A6000 $0.49/h (5umn3qjlrlmp4u)

	Burn rate: $1.36/h. At ~24h-to-K2 worst case = ~$33. Within budget.

	F pod bootstrap restarted after a path-case bug (looked for /workspace/physiojepa
	but tar extracted /workspace/PhysioJEPA). Fixed pod_bootstrap.sh to detect either.
	Forced tarball rebuild.

	Bootstrap timing on F pod (RTX A6000):
	- uv install + dep sync: ~3 min (torch 2.11, wandb, scipy, neurokit2, datasets, etc.)
	- HF MIMIC download (1237 files / ~1.5 GB): 48 seconds at ~30 MB/s
	- Window index build: pending — single-threaded scan of 412 shards × ~100 segments
	× ~10 windows each ≈ ~400k windows. This is the bottleneck.

	Deployed A, B, C in parallel (backgrounded scp+bootstrap) while F builds index.

	Architectural caveat noted: each pod independently downloads + builds the same
	index. Wasteful (~$2 total in download time) but cheaper than engineering a
	shared-cache pattern under time pressure. Logging for next iteration.


	User pick: Option 1 with the addition that after K2 we don't kill the winners — keep E3 and the best baseline running on the A40 toward epoch 100 while deciding whether to promote to H100. Cost of leaving an A40 running ≪ cost of cold-booting an H100. Locking that into the plan.

	## 2026-04-14 — Harness built + smoke-tested + budget reality check

	What's done:
	- Full training harness committed: `src/physiojepa/{vit,dt_embed,ema,masking,data,monitor,probe,ptbxl,models,trainer}.py`.
	- Four models implemented (`A, B, C, F`), all sharing encoders/predictor, differing only in loss and Δt handling.
	- Shared config: `configs/base.yaml`. CLI: `scripts/train.py`, `scripts/prepare_data.py`, `scripts/smoke_test.py`.
	- Smoke test passed on CPU: all 4 models forward+backward clean, losses decrease monotonically over 3 steps on random data. Baseline C starts at ln(B)=1.386 as expected for untrained InfoNCE.
	- RunPod CLI functional, $50.05 balance, no pods running.

	Architectural notes / caveats:
	- EMA is per online encoder (ECG gets EMA target, PPG gets EMA target); InfoNCE (Baseline C) has no EMA by design.
	- Self-prediction loop is per-sample (variable mask lengths). Correct but slower than padded-batch on GPU; optimisation deferred unless step time becomes the bottleneck.
	- Δt conditioning is added as an extra KV token, not replacing any PPG query. This keeps the predictor architecturally identical between Baseline B (no Δt) and E3 (Δt token) — the only real difference is whether that extra token is present. This means Baseline B and E3 are not bit-for-bit identical in parameter count (E3 has the DeltaTEmbedding MLP). Noting for the paper's "isolated variable" claim — documenting the delta explicitly.

	Budget issue requires a scope decision BEFORE launching RunPod:
	- RunPod balance: $50.05. Spend limit: $80.
	- Research doc's "~$500 on H100" assumed sequential runs, not 4× parallel. Parallel 4× 100-epoch on H100 ($3–4/h) for ~48h = ~$600–$800. Over limit.
	- Even on RTX 3090 ($0.30/h community), 4×100 epochs sequentially ≈ 100h ≈ $30 — within budget but serial wall-clock is days.
	- The K2 verdict lands at epoch 25 per the matrix's C5 checkpoint. Paper-existence is decided at epoch 25, not 100. Running to 100 is polish, not decision.

	Plan revision (to be confirmed with user):
	1. Start 4× parallel on A40 (cheap, ~$0.35/h on community cloud). ~25 epochs to K2 checkpoint.
	2. Epoch 25 = gate. If K2 passes (E3 > Baseline B by ≥0.02 AUROC), run only the winner (E3) and Baseline A to epoch 100 on a single H100.
	3. If K2 fails at epoch 25, stop, write up negative result, preserve budget.

	Total expected spend under this plan: ~$15–25 for K2 decision, another $30 for final runs = ~$50. Fits budget.

	Flagging the plan change explicitly because it deviates from the user's instruction "launch all four runs in parallel, same random seeds, 100 epochs each". The revision keeps parallelism (4 runs in parallel to epoch 25) and keeps 100 epochs as the aspiration, but makes epoch-25 a real decision gate for compute spend — which matches the matrix's own kill criteria.

	---

	## 2026-04-14 — E2/E3 kickoff

	Scope: build shared harness, implement four models (Baseline A/B/C + E3 PhysioJEPA), CPU single-batch test, then launch 4× parallel H100 training on RunPod.

	Context carried in:
	- E0 GO (381 patients, 814 h, sample-accurate aligned, 0% NaN) — `docs/e0_data_card.md`
	- E1 raw patches locked for v1 — `docs/e1_decision.md`
	- AF labels = PTB-XL (transfer claim) — `docs/af_label_decision.md`
	- v1 arch: single-lead II ECG @ 250 Hz, PPG @ 125 Hz, 200 ms patches — in `RESEARCH_DEVELOPMENT.md` §2

	Plan:
	1. Harness: Dataset/DataLoader, EMA, linear probe, collapse monitor, WandB logger, shared config.
	2. Models: four-way parallel implementation, single shared codebase differing only in loss + Δt.
	3. RunPod: no skill installed — will use REST API via `RUNPOD_API_KEY`.
	4. Single-batch CPU test before any GPU run.

	Entries below will capture every decision, failure, and caveat.