dnathinker-checkpoints / lab_message_2026_04_27_v3.md

Upload lab_message_2026_04_27_v3.md with huggingface_hub

8f07afa verified about 1 month ago

7.39 kB

	# Reply to lab — 2026-04-27 ~05:40 UTC

	Status absorbed. The split looks right to me — **H100 and lab are not
	duplicating** if we lock in this division of labour:

	## H100 (single GPU)

	Locked in to the headline LLaVA-mode chain + everything that
	needs the multi-turn RFT / reasoning expansion / sanitiser commits
	(all on `mllm-integrate-server2`):

	* T3 zs_raw bench (36% in flight)
	* T3 zs_enriched bench (queued)
	* Stages 1–4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask
	fusion-SFT in LLaVA mode (the control for your arch ablation)
	* Stage 3b T3 reasoning-only ablation
	* Stage 3c T3 multi-turn RFT + retrain
	* Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
	* Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
	* Stage 3f T1 reasoning-trace expansion (already accumulating;
	333/333 today, will keep accumulating daily)
	* Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` +
	`run_generation_eval.py` on every predictions.jsonl as it lands —
	no wait for the post-bench pipeline.

	So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint
	multitask headline + T3-specific RFT/reasoning rows all come from H100.

	## Lab cluster (multi-GPU)

	Locked in to ablations + RL + the lab-only data pipeline:

	* 226075/076/077 arch ablation (LLaVA / unified+ntp / unified+mdlm)
	→ Table 3 Phase 2 row.
	* 226057 sv_gspo_v5 → Table 1 row 5/6 (RL on top of Loop-SFT).
	* 226086 NTv3-8m encoder → multi-encoder grid for Table 3 §4c.
	* 226090 T1+T3 full enriched JSONL HF upload → unblocks H100's
	full-N benches once landed.
	* 226095 T2 regen v4 → unblocks H100's Stage 3e (T2 reasoning
	expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.

	When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot
	on the new enriched data (need the proper enhancer-side TFBS evidence
	for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings
	immediately.

	## On 226075-077 loss curves

	Loss 6→8 with bf16 NaN at step 1000 eval — I see why you're worried,
	but two notes:

	1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means
	even if eval keeps NaN-ing, the final adapter is recoverable from
	the latest periodic-save checkpoint. Good.
	2. The 17.6 → 0.06 collapse on the earlier unified+ntp run was almost
	certainly the collator bug pre-`bda9ee0`. Now that the SFT collator
	sanitises before tokenisation, the model can no longer cheat on
	`peak_name=…` / `Observed dataset row …` / `label_source=…`. Loss
	floors should be closer to "real" (~3–5 at 1 epoch on 105k).

	If the unified+mdlm run is still tracking similar to LLaVA at step
	2000+, that itself is a paper-worthy result (architecture-mode
	ablation says: same data + same encoder + LoRA → DNA head doesn't
	matter much, the LLM head is good enough). Worth committing to even
	if curves stay noisy.

	## 🚨 Enformer oracle (225956) — investigate

	53h zero-progress is almost certainly a stuck dataloader or NFS hang,
	not "almost done". Suggested debug, in order of effort:

	```bash
	# 1. Confirm the process is alive (if it's a zombie just kill it):
	ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N \| xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' \|\| echo MISSING)

	# 2. py-spy dump to see exactly where it's stuck (no GPU
	# interference — read-only stack inspection):
	sudo py-spy dump --pid <pid>
	# Common: stuck in DataLoader workers waiting on a NFS file open.

	# 3. If py-spy points at a DataLoader worker:
	# - Kill the job. Re-launch with --num-workers=0 (single-process
	# dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
	# so we see step ticks immediately.
	# - That gets you 50% throughput vs 8 workers but at least
	# finishes.

	# 4. If py-spy points at flash_attn or model.forward:
	# Probably a bf16 NaN in the warmup; restart with fp32 master
	# weights or --bf16=false for the first 500 steps.

	# 5. Check the actual log for the ABSOLUTE LATEST line (not just
	# "Loading weights 100%"):
	tail -1 /path/to/225956/log
	ls -la /path/to/225956/log # last-modified timestamp
	# If timestamp is 53h ago, dataloader hang.
	# If timestamp is < 5 min ago but logging is sparse, --logging-steps
	# is mis-set (default 500 means 500 backward passes before first
	# step log, which on Enformer is ~50 min).
	```

	If you'd rather have H100 take over Enformer training, I can fire it
	once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
	Enformer should take ~12–15h. Just tell me the input-data path
	(your /extra/zhanglab0 path) + epoch count and I'll launch
	`slurm/run_oracle_enformer.sh`-equivalent locally. But —
	DeepSTARR-7cell at val_pearson=0.136 is what every T3 paper row
	in the post-bench pipeline scores against, and it's "good enough" in
	the sense that we use deltas (activity_delta_src,
	activity_relative_shift) where weak oracles still rank correctly.
	Enformer is a Table 4 cross-oracle robustness check, not the headline.
	So skipping it for v1 is fine if it costs another 50h of GPU.

	## Innovative-component work prioritisation

	Per your note "make the whole pipeline with all innovative component
	work is important", the pipeline today has every component wired:

	\| Component \| Path \| Status \|
	\|---\|---\|---\|
	\| LLaVA-mode fusion-SFT \| `train_fusion_sft.py --architecture-mode llava` \| ✅ default; lab 226075 \|
	\| Unified-mode (NTP / MDLM) \| `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` \| ✅ wired; lab 226076/077 \|
	\| Diffusion (LLaDA) \| `--architecture-mode diffusion` \| NOT YET WIRED (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") \|
	\| RFT multi-turn \| `scripts/rft_t3.py --rounds 4 --candidates 4` \| ✅; H100 Stage 3c \|
	\| Reasoning expansion (OpenRouter Ling-1T) \| `scripts/build_reasoning_traces.py` \| ✅; H100 333 T1 done, T2/T3 gated \|
	\| Loop-SFT \| `scripts/train_loop_sft.py` \| ✅ wired; needs trajectory dataset \|
	\| SV-GSPO \| `scripts/train_sv_gspo.py` \| ✅; lab 226057 \|

	The single missing innovative component is Phase 3 diffusion.
	Roughly 1–2 days of work (need to add a `DNAOutputHead` for LLaDA
	+ collator changes + reverse-process sampler). I can take this on H100
	once Stage 4 finishes, or drop into the Phase 4 ("after submission")
	list. Your call.

	## What I'm doing in the meantime (next 5 hours)

	1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
	2. Watching for your full-enriched HF push (~5h ETA per your message);
	when it lands I'll pull and re-fire T2 zs bench on the new
	enriched data.
	3. T1 reasoning loop will keep accumulating (333/day — ~10k rows in
	~30 days at single-key, faster if more keys).

	After T3 bench finishes:
	4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically.
	5. I might also pivot to Phase 3 diffusion implementation if you're
	OK with H100 spending ~2 days on it after Stage 4 lands.

	## Ask

	* Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline
	joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO +
	multi-encoder + T2 regen + Enformer.
	* On Enformer hang: are you OK with H100 taking it over after T3
	bench finishes? Or stick with DeepSTARR-7cell only and defer
	Enformer to extended-paper review pass?
	* On Phase 3 diffusion: drop now, or implement after H100's Stage 4
	finishes? (~2 days.)

	— H100 side