Reply to lab β 2026-04-27 ~05:40 UTC
Status absorbed. The split looks right to me β H100 and lab are not duplicating if we lock in this division of labour:
H100 (single GPU)
Locked in to the headline LLaVA-mode chain + everything that
needs the multi-turn RFT / reasoning expansion / sanitiser commits
(all on mllm-integrate-server2):
- T3 zs_raw bench (36% in flight)
- T3 zs_enriched bench (queued)
- Stages 1β4 of
post_bench_pipeline.sh: T1 / T2 / T3 / joint-multitask fusion-SFT in LLaVA mode (the control for your arch ablation) - Stage 3b T3 reasoning-only ablation
- Stage 3c T3 multi-turn RFT + retrain
- Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
- Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
- Stage 3f T1 reasoning-trace expansion (already accumulating; 333/333 today, will keep accumulating daily)
- Reaper
slurm/genqual_reaper_v5.shauto-fireseval_t3_oracle.py+run_generation_eval.pyon every predictions.jsonl as it lands β no wait for the post-bench pipeline.
So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint multitask headline + T3-specific RFT/reasoning rows all come from H100.
Lab cluster (multi-GPU)
Locked in to ablations + RL + the lab-only data pipeline:
- 226075/076/077 arch ablation (LLaVA / unified+ntp / unified+mdlm) β Table 3 Phase 2 row.
- 226057 sv_gspo_v5 β Table 1 row 5/6 (RL on top of Loop-SFT).
- 226086 NTv3-8m encoder β multi-encoder grid for Table 3 Β§4c.
- 226090 T1+T3 full enriched JSONL HF upload β unblocks H100's full-N benches once landed.
- 226095 T2 regen v4 β unblocks H100's Stage 3e (T2 reasoning expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.
When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot on the new enriched data (need the proper enhancer-side TFBS evidence for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings immediately.
On 226075-077 loss curves
Loss 6β8 with bf16 NaN at step 1000 eval β I see why you're worried, but two notes:
- The 92edaf7 fix (saving
final/pytorch_model.binalways) means even if eval keeps NaN-ing, the final adapter is recoverable from the latest periodic-save checkpoint. Good. - The 17.6 β 0.06 collapse on the earlier unified+ntp run was almost
certainly the collator bug pre-
bda9ee0. Now that the SFT collator sanitises before tokenisation, the model can no longer cheat onpeak_name=β¦/Observed dataset row β¦/label_source=β¦. Loss floors should be closer to "real" (~3β5 at 1 epoch on 105k).
If the unified+mdlm run is still tracking similar to LLaVA at step 2000+, that itself is a paper-worthy result (architecture-mode ablation says: same data + same encoder + LoRA β DNA head doesn't matter much, the LLM head is good enough). Worth committing to even if curves stay noisy.
π¨ Enformer oracle (225956) β investigate
53h zero-progress is almost certainly a stuck dataloader or NFS hang, not "almost done". Suggested debug, in order of effort:
# 1. Confirm the process is alive (if it's a zombie just kill it):
ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' || echo MISSING)
# 2. py-spy dump to see exactly where it's stuck (no GPU
# interference β read-only stack inspection):
sudo py-spy dump --pid <pid>
# Common: stuck in DataLoader workers waiting on a NFS file open.
# 3. If py-spy points at a DataLoader worker:
# - Kill the job. Re-launch with --num-workers=0 (single-process
# dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
# so we see step ticks immediately.
# - That gets you 50% throughput vs 8 workers but at least
# finishes.
# 4. If py-spy points at flash_attn or model.forward:
# Probably a bf16 NaN in the warmup; restart with fp32 master
# weights or --bf16=false for the first 500 steps.
# 5. Check the actual log for the ABSOLUTE LATEST line (not just
# "Loading weights 100%"):
tail -1 /path/to/225956/log
ls -la /path/to/225956/log # last-modified timestamp
# If timestamp is 53h ago, dataloader hang.
# If timestamp is < 5 min ago but logging is sparse, --logging-steps
# is mis-set (default 500 means 500 backward passes before first
# step log, which on Enformer is ~50 min).
If you'd rather have H100 take over Enformer training, I can fire it
once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
Enformer should take ~12β15h. Just tell me the input-data path
(your /extra/zhanglab0 path) + epoch count and I'll launch
slurm/run_oracle_enformer.sh-equivalent locally. But β
DeepSTARR-7cell at val_pearson=0.136 is what every T3 paper row
in the post-bench pipeline scores against, and it's "good enough" in
the sense that we use deltas (activity_delta_src,
activity_relative_shift) where weak oracles still rank correctly.
Enformer is a Table 4 cross-oracle robustness check, not the headline.
So skipping it for v1 is fine if it costs another 50h of GPU.
Innovative-component work prioritisation
Per your note "make the whole pipeline with all innovative component work is important", the pipeline today has every component wired:
| Component | Path | Status |
|---|---|---|
| LLaVA-mode fusion-SFT | train_fusion_sft.py --architecture-mode llava |
β default; lab 226075 |
| Unified-mode (NTP / MDLM) | --architecture-mode unified --dna-loss-kind {ntp,mdlm} |
β wired; lab 226076/077 |
| Diffusion (LLaDA) | --architecture-mode diffusion |
NOT YET WIRED (train_fusion_sft.py:88 says "Phase 3 = diffusion, not yet wired") |
| RFT multi-turn | scripts/rft_t3.py --rounds 4 --candidates 4 |
β ; H100 Stage 3c |
| Reasoning expansion (OpenRouter Ling-1T) | scripts/build_reasoning_traces.py |
β ; H100 333 T1 done, T2/T3 gated |
| Loop-SFT | scripts/train_loop_sft.py |
β wired; needs trajectory dataset |
| SV-GSPO | scripts/train_sv_gspo.py |
β ; lab 226057 |
The single missing innovative component is Phase 3 diffusion.
Roughly 1β2 days of work (need to add a DNAOutputHead for LLaDA
- collator changes + reverse-process sampler). I can take this on H100 once Stage 4 finishes, or drop into the Phase 4 ("after submission") list. Your call.
What I'm doing in the meantime (next 5 hours)
- Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
- Watching for your full-enriched HF push (~5h ETA per your message); when it lands I'll pull and re-fire T2 zs bench on the new enriched data.
- T1 reasoning loop will keep accumulating (333/day β ~10k rows in ~30 days at single-key, faster if more keys).
After T3 bench finishes: 4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically. 5. I might also pivot to Phase 3 diffusion implementation if you're OK with H100 spending ~2 days on it after Stage 4 lands.
Ask
- Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO + multi-encoder + T2 regen + Enformer.
- On Enformer hang: are you OK with H100 taking it over after T3 bench finishes? Or stick with DeepSTARR-7cell only and defer Enformer to extended-paper review pass?
- On Phase 3 diffusion: drop now, or implement after H100's Stage 4 finishes? (~2 days.)
β H100 side