| # Reply to lab β 2026-04-27 ~05:40 UTC |
|
|
| Status absorbed. The split looks right to me β **H100 and lab are not |
| duplicating** if we lock in this division of labour: |
|
|
| ## H100 (single GPU) |
|
|
| Locked in to the **headline LLaVA-mode chain** + everything that |
| needs the multi-turn RFT / reasoning expansion / sanitiser commits |
| (all on `mllm-integrate-server2`): |
|
|
| * T3 zs_raw bench (36% in flight) |
| * T3 zs_enriched bench (queued) |
| * Stages 1β4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask |
| fusion-SFT in LLaVA mode (the **control** for your arch ablation) |
| * Stage 3b T3 reasoning-only ablation |
| * Stage 3c T3 multi-turn RFT + retrain |
| * Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated) |
| * Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker) |
| * Stage 3f T1 reasoning-trace expansion (already accumulating; |
| 333/333 today, will keep accumulating daily) |
| * Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` + |
| `run_generation_eval.py` on every predictions.jsonl as it lands β |
| no wait for the post-bench pipeline. |
|
|
| So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint |
| multitask headline + T3-specific RFT/reasoning rows all come from H100. |
|
|
| ## Lab cluster (multi-GPU) |
|
|
| Locked in to ablations + RL + the lab-only data pipeline: |
|
|
| * **226075/076/077** arch ablation (LLaVA / unified+ntp / unified+mdlm) |
| β **Table 3 Phase 2** row. |
| * **226057** sv_gspo_v5 β **Table 1 row 5/6** (RL on top of Loop-SFT). |
| * **226086** NTv3-8m encoder β multi-encoder grid for **Table 3 Β§4c**. |
| * **226090** T1+T3 full enriched JSONL HF upload β unblocks H100's |
| full-N benches once landed. |
| * **226095** T2 regen v4 β unblocks H100's Stage 3e (T2 reasoning |
| expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans. |
|
|
| When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot |
| on the new enriched data (need the proper enhancer-side TFBS evidence |
| for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings |
| immediately. |
|
|
| ## On 226075-077 loss curves |
|
|
| Loss 6β8 with bf16 NaN at step 1000 eval β I see why you're worried, |
| but two notes: |
|
|
| 1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means |
| even if eval keeps NaN-ing, the final adapter is recoverable from |
| the latest periodic-save checkpoint. Good. |
| 2. The 17.6 β 0.06 collapse on the earlier unified+ntp run was almost |
| certainly the collator bug pre-`bda9ee0`. Now that the SFT collator |
| sanitises before tokenisation, the model can no longer cheat on |
| `peak_name=β¦` / `Observed dataset row β¦` / `label_source=β¦`. Loss |
| floors should be closer to "real" (~3β5 at 1 epoch on 105k). |
|
|
| If the unified+mdlm run is still tracking similar to LLaVA at step |
| 2000+, that itself is a paper-worthy result (architecture-mode |
| ablation says: same data + same encoder + LoRA β DNA head doesn't |
| matter much, the LLM head is good enough). Worth committing to even |
| if curves stay noisy. |
|
|
| ## π¨ Enformer oracle (225956) β investigate |
|
|
| 53h zero-progress is almost certainly a stuck dataloader or NFS hang, |
| not "almost done". Suggested debug, in order of effort: |
|
|
| ```bash |
| # 1. Confirm the process is alive (if it's a zombie just kill it): |
| ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' || echo MISSING) |
| |
| # 2. py-spy dump to see exactly where it's stuck (no GPU |
| # interference β read-only stack inspection): |
| sudo py-spy dump --pid <pid> |
| # Common: stuck in DataLoader workers waiting on a NFS file open. |
| |
| # 3. If py-spy points at a DataLoader worker: |
| # - Kill the job. Re-launch with --num-workers=0 (single-process |
| # dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1 |
| # so we see step ticks immediately. |
| # - That gets you 50% throughput vs 8 workers but at least |
| # finishes. |
| |
| # 4. If py-spy points at flash_attn or model.forward: |
| # Probably a bf16 NaN in the warmup; restart with fp32 master |
| # weights or --bf16=false for the first 500 steps. |
| |
| # 5. Check the actual log for the ABSOLUTE LATEST line (not just |
| # "Loading weights 100%"): |
| tail -1 /path/to/225956/log |
| ls -la /path/to/225956/log # last-modified timestamp |
| # If timestamp is 53h ago, dataloader hang. |
| # If timestamp is < 5 min ago but logging is sparse, --logging-steps |
| # is mis-set (default 500 means 500 backward passes before first |
| # step log, which on Enformer is ~50 min). |
| ``` |
|
|
| If you'd rather have H100 take over Enformer training, I can fire it |
| once T3 bench finishes (~5h). H100 is fast enough that 20-epoch |
| Enformer should take ~12β15h. Just tell me the input-data path |
| (your /extra/zhanglab0 path) + epoch count and I'll launch |
| `slurm/run_oracle_enformer.sh`-equivalent locally. **But** β |
| **DeepSTARR-7cell at val_pearson=0.136** is what every T3 paper row |
| in the post-bench pipeline scores against, and it's "good enough" in |
| the sense that we use *deltas* (activity_delta_src, |
| activity_relative_shift) where weak oracles still rank correctly. |
| Enformer is a Table 4 cross-oracle robustness check, not the headline. |
| So skipping it for v1 is fine if it costs another 50h of GPU. |
| |
| ## Innovative-component work prioritisation |
| |
| Per your note "make the whole pipeline with all innovative component |
| work is important", the pipeline today has every component wired: |
| |
| | Component | Path | Status | |
| |---|---|---| |
| | LLaVA-mode fusion-SFT | `train_fusion_sft.py --architecture-mode llava` | β
default; lab 226075 | |
| | Unified-mode (NTP / MDLM) | `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` | β
wired; lab 226076/077 | |
| | Diffusion (LLaDA) | `--architecture-mode diffusion` | **NOT YET WIRED** (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") | |
| | RFT multi-turn | `scripts/rft_t3.py --rounds 4 --candidates 4` | β
; H100 Stage 3c | |
| | Reasoning expansion (OpenRouter Ling-1T) | `scripts/build_reasoning_traces.py` | β
; H100 333 T1 done, T2/T3 gated | |
| | Loop-SFT | `scripts/train_loop_sft.py` | β
wired; needs trajectory dataset | |
| | SV-GSPO | `scripts/train_sv_gspo.py` | β
; lab 226057 | |
| |
| **The single missing innovative component is Phase 3 diffusion**. |
| Roughly 1β2 days of work (need to add a `DNAOutputHead` for LLaDA |
| + collator changes + reverse-process sampler). I can take this on H100 |
| once Stage 4 finishes, or drop into the Phase 4 ("after submission") |
| list. Your call. |
| |
| ## What I'm doing in the meantime (next 5 hours) |
| |
| 1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining). |
| 2. Watching for your full-enriched HF push (~5h ETA per your message); |
| when it lands I'll pull and re-fire T2 zs bench on the new |
| enriched data. |
| 3. T1 reasoning loop will keep accumulating (333/day β ~10k rows in |
| ~30 days at single-key, faster if more keys). |
| |
| After T3 bench finishes: |
| 4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically. |
| 5. I might also pivot to Phase 3 diffusion implementation if you're |
| OK with H100 spending ~2 days on it after Stage 4 lands. |
| |
| ## Ask |
| |
| * Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline |
| joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO + |
| multi-encoder + T2 regen + Enformer. |
| * On Enformer hang: are you OK with H100 taking it over after T3 |
| bench finishes? Or stick with DeepSTARR-7cell only and defer |
| Enformer to extended-paper review pass? |
| * On Phase 3 diffusion: drop now, or implement after H100's Stage 4 |
| finishes? (~2 days.) |
| |
| β H100 side |
| |