File size: 7,391 Bytes
8f07afa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | # Reply to lab β 2026-04-27 ~05:40 UTC
Status absorbed. The split looks right to me β **H100 and lab are not
duplicating** if we lock in this division of labour:
## H100 (single GPU)
Locked in to the **headline LLaVA-mode chain** + everything that
needs the multi-turn RFT / reasoning expansion / sanitiser commits
(all on `mllm-integrate-server2`):
* T3 zs_raw bench (36% in flight)
* T3 zs_enriched bench (queued)
* Stages 1β4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask
fusion-SFT in LLaVA mode (the **control** for your arch ablation)
* Stage 3b T3 reasoning-only ablation
* Stage 3c T3 multi-turn RFT + retrain
* Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
* Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
* Stage 3f T1 reasoning-trace expansion (already accumulating;
333/333 today, will keep accumulating daily)
* Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` +
`run_generation_eval.py` on every predictions.jsonl as it lands β
no wait for the post-bench pipeline.
So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint
multitask headline + T3-specific RFT/reasoning rows all come from H100.
## Lab cluster (multi-GPU)
Locked in to ablations + RL + the lab-only data pipeline:
* **226075/076/077** arch ablation (LLaVA / unified+ntp / unified+mdlm)
β **Table 3 Phase 2** row.
* **226057** sv_gspo_v5 β **Table 1 row 5/6** (RL on top of Loop-SFT).
* **226086** NTv3-8m encoder β multi-encoder grid for **Table 3 Β§4c**.
* **226090** T1+T3 full enriched JSONL HF upload β unblocks H100's
full-N benches once landed.
* **226095** T2 regen v4 β unblocks H100's Stage 3e (T2 reasoning
expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.
When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot
on the new enriched data (need the proper enhancer-side TFBS evidence
for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings
immediately.
## On 226075-077 loss curves
Loss 6β8 with bf16 NaN at step 1000 eval β I see why you're worried,
but two notes:
1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means
even if eval keeps NaN-ing, the final adapter is recoverable from
the latest periodic-save checkpoint. Good.
2. The 17.6 β 0.06 collapse on the earlier unified+ntp run was almost
certainly the collator bug pre-`bda9ee0`. Now that the SFT collator
sanitises before tokenisation, the model can no longer cheat on
`peak_name=β¦` / `Observed dataset row β¦` / `label_source=β¦`. Loss
floors should be closer to "real" (~3β5 at 1 epoch on 105k).
If the unified+mdlm run is still tracking similar to LLaVA at step
2000+, that itself is a paper-worthy result (architecture-mode
ablation says: same data + same encoder + LoRA β DNA head doesn't
matter much, the LLM head is good enough). Worth committing to even
if curves stay noisy.
## π¨ Enformer oracle (225956) β investigate
53h zero-progress is almost certainly a stuck dataloader or NFS hang,
not "almost done". Suggested debug, in order of effort:
```bash
# 1. Confirm the process is alive (if it's a zombie just kill it):
ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' || echo MISSING)
# 2. py-spy dump to see exactly where it's stuck (no GPU
# interference β read-only stack inspection):
sudo py-spy dump --pid <pid>
# Common: stuck in DataLoader workers waiting on a NFS file open.
# 3. If py-spy points at a DataLoader worker:
# - Kill the job. Re-launch with --num-workers=0 (single-process
# dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
# so we see step ticks immediately.
# - That gets you 50% throughput vs 8 workers but at least
# finishes.
# 4. If py-spy points at flash_attn or model.forward:
# Probably a bf16 NaN in the warmup; restart with fp32 master
# weights or --bf16=false for the first 500 steps.
# 5. Check the actual log for the ABSOLUTE LATEST line (not just
# "Loading weights 100%"):
tail -1 /path/to/225956/log
ls -la /path/to/225956/log # last-modified timestamp
# If timestamp is 53h ago, dataloader hang.
# If timestamp is < 5 min ago but logging is sparse, --logging-steps
# is mis-set (default 500 means 500 backward passes before first
# step log, which on Enformer is ~50 min).
```
If you'd rather have H100 take over Enformer training, I can fire it
once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
Enformer should take ~12β15h. Just tell me the input-data path
(your /extra/zhanglab0 path) + epoch count and I'll launch
`slurm/run_oracle_enformer.sh`-equivalent locally. **But** β
**DeepSTARR-7cell at val_pearson=0.136** is what every T3 paper row
in the post-bench pipeline scores against, and it's "good enough" in
the sense that we use *deltas* (activity_delta_src,
activity_relative_shift) where weak oracles still rank correctly.
Enformer is a Table 4 cross-oracle robustness check, not the headline.
So skipping it for v1 is fine if it costs another 50h of GPU.
## Innovative-component work prioritisation
Per your note "make the whole pipeline with all innovative component
work is important", the pipeline today has every component wired:
| Component | Path | Status |
|---|---|---|
| LLaVA-mode fusion-SFT | `train_fusion_sft.py --architecture-mode llava` | β
default; lab 226075 |
| Unified-mode (NTP / MDLM) | `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` | β
wired; lab 226076/077 |
| Diffusion (LLaDA) | `--architecture-mode diffusion` | **NOT YET WIRED** (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") |
| RFT multi-turn | `scripts/rft_t3.py --rounds 4 --candidates 4` | β
; H100 Stage 3c |
| Reasoning expansion (OpenRouter Ling-1T) | `scripts/build_reasoning_traces.py` | β
; H100 333 T1 done, T2/T3 gated |
| Loop-SFT | `scripts/train_loop_sft.py` | β
wired; needs trajectory dataset |
| SV-GSPO | `scripts/train_sv_gspo.py` | β
; lab 226057 |
**The single missing innovative component is Phase 3 diffusion**.
Roughly 1β2 days of work (need to add a `DNAOutputHead` for LLaDA
+ collator changes + reverse-process sampler). I can take this on H100
once Stage 4 finishes, or drop into the Phase 4 ("after submission")
list. Your call.
## What I'm doing in the meantime (next 5 hours)
1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
2. Watching for your full-enriched HF push (~5h ETA per your message);
when it lands I'll pull and re-fire T2 zs bench on the new
enriched data.
3. T1 reasoning loop will keep accumulating (333/day β ~10k rows in
~30 days at single-key, faster if more keys).
After T3 bench finishes:
4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically.
5. I might also pivot to Phase 3 diffusion implementation if you're
OK with H100 spending ~2 days on it after Stage 4 lands.
## Ask
* Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline
joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO +
multi-encoder + T2 regen + Enformer.
* On Enformer hang: are you OK with H100 taking it over after T3
bench finishes? Or stick with DeepSTARR-7cell only and defer
Enformer to extended-paper review pass?
* On Phase 3 diffusion: drop now, or implement after H100's Stage 4
finishes? (~2 days.)
β H100 side
|