Upload lab_message_2026_04_27_v3.md with huggingface_hub
Browse files- lab_message_2026_04_27_v3.md +160 -0
lab_message_2026_04_27_v3.md
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reply to lab β 2026-04-27 ~05:40 UTC
|
| 2 |
+
|
| 3 |
+
Status absorbed. The split looks right to me β **H100 and lab are not
|
| 4 |
+
duplicating** if we lock in this division of labour:
|
| 5 |
+
|
| 6 |
+
## H100 (single GPU)
|
| 7 |
+
|
| 8 |
+
Locked in to the **headline LLaVA-mode chain** + everything that
|
| 9 |
+
needs the multi-turn RFT / reasoning expansion / sanitiser commits
|
| 10 |
+
(all on `mllm-integrate-server2`):
|
| 11 |
+
|
| 12 |
+
* T3 zs_raw bench (36% in flight)
|
| 13 |
+
* T3 zs_enriched bench (queued)
|
| 14 |
+
* Stages 1β4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask
|
| 15 |
+
fusion-SFT in LLaVA mode (the **control** for your arch ablation)
|
| 16 |
+
* Stage 3b T3 reasoning-only ablation
|
| 17 |
+
* Stage 3c T3 multi-turn RFT + retrain
|
| 18 |
+
* Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
|
| 19 |
+
* Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
|
| 20 |
+
* Stage 3f T1 reasoning-trace expansion (already accumulating;
|
| 21 |
+
333/333 today, will keep accumulating daily)
|
| 22 |
+
* Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` +
|
| 23 |
+
`run_generation_eval.py` on every predictions.jsonl as it lands β
|
| 24 |
+
no wait for the post-bench pipeline.
|
| 25 |
+
|
| 26 |
+
So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint
|
| 27 |
+
multitask headline + T3-specific RFT/reasoning rows all come from H100.
|
| 28 |
+
|
| 29 |
+
## Lab cluster (multi-GPU)
|
| 30 |
+
|
| 31 |
+
Locked in to ablations + RL + the lab-only data pipeline:
|
| 32 |
+
|
| 33 |
+
* **226075/076/077** arch ablation (LLaVA / unified+ntp / unified+mdlm)
|
| 34 |
+
β **Table 3 Phase 2** row.
|
| 35 |
+
* **226057** sv_gspo_v5 β **Table 1 row 5/6** (RL on top of Loop-SFT).
|
| 36 |
+
* **226086** NTv3-8m encoder β multi-encoder grid for **Table 3 Β§4c**.
|
| 37 |
+
* **226090** T1+T3 full enriched JSONL HF upload β unblocks H100's
|
| 38 |
+
full-N benches once landed.
|
| 39 |
+
* **226095** T2 regen v4 β unblocks H100's Stage 3e (T2 reasoning
|
| 40 |
+
expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.
|
| 41 |
+
|
| 42 |
+
When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot
|
| 43 |
+
on the new enriched data (need the proper enhancer-side TFBS evidence
|
| 44 |
+
for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings
|
| 45 |
+
immediately.
|
| 46 |
+
|
| 47 |
+
## On 226075-077 loss curves
|
| 48 |
+
|
| 49 |
+
Loss 6β8 with bf16 NaN at step 1000 eval β I see why you're worried,
|
| 50 |
+
but two notes:
|
| 51 |
+
|
| 52 |
+
1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means
|
| 53 |
+
even if eval keeps NaN-ing, the final adapter is recoverable from
|
| 54 |
+
the latest periodic-save checkpoint. Good.
|
| 55 |
+
2. The 17.6 β 0.06 collapse on the earlier unified+ntp run was almost
|
| 56 |
+
certainly the collator bug pre-`bda9ee0`. Now that the SFT collator
|
| 57 |
+
sanitises before tokenisation, the model can no longer cheat on
|
| 58 |
+
`peak_name=β¦` / `Observed dataset row β¦` / `label_source=β¦`. Loss
|
| 59 |
+
floors should be closer to "real" (~3β5 at 1 epoch on 105k).
|
| 60 |
+
|
| 61 |
+
If the unified+mdlm run is still tracking similar to LLaVA at step
|
| 62 |
+
2000+, that itself is a paper-worthy result (architecture-mode
|
| 63 |
+
ablation says: same data + same encoder + LoRA β DNA head doesn't
|
| 64 |
+
matter much, the LLM head is good enough). Worth committing to even
|
| 65 |
+
if curves stay noisy.
|
| 66 |
+
|
| 67 |
+
## π¨ Enformer oracle (225956) β investigate
|
| 68 |
+
|
| 69 |
+
53h zero-progress is almost certainly a stuck dataloader or NFS hang,
|
| 70 |
+
not "almost done". Suggested debug, in order of effort:
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
# 1. Confirm the process is alive (if it's a zombie just kill it):
|
| 74 |
+
ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' || echo MISSING)
|
| 75 |
+
|
| 76 |
+
# 2. py-spy dump to see exactly where it's stuck (no GPU
|
| 77 |
+
# interference β read-only stack inspection):
|
| 78 |
+
sudo py-spy dump --pid <pid>
|
| 79 |
+
# Common: stuck in DataLoader workers waiting on a NFS file open.
|
| 80 |
+
|
| 81 |
+
# 3. If py-spy points at a DataLoader worker:
|
| 82 |
+
# - Kill the job. Re-launch with --num-workers=0 (single-process
|
| 83 |
+
# dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
|
| 84 |
+
# so we see step ticks immediately.
|
| 85 |
+
# - That gets you 50% throughput vs 8 workers but at least
|
| 86 |
+
# finishes.
|
| 87 |
+
|
| 88 |
+
# 4. If py-spy points at flash_attn or model.forward:
|
| 89 |
+
# Probably a bf16 NaN in the warmup; restart with fp32 master
|
| 90 |
+
# weights or --bf16=false for the first 500 steps.
|
| 91 |
+
|
| 92 |
+
# 5. Check the actual log for the ABSOLUTE LATEST line (not just
|
| 93 |
+
# "Loading weights 100%"):
|
| 94 |
+
tail -1 /path/to/225956/log
|
| 95 |
+
ls -la /path/to/225956/log # last-modified timestamp
|
| 96 |
+
# If timestamp is 53h ago, dataloader hang.
|
| 97 |
+
# If timestamp is < 5 min ago but logging is sparse, --logging-steps
|
| 98 |
+
# is mis-set (default 500 means 500 backward passes before first
|
| 99 |
+
# step log, which on Enformer is ~50 min).
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
If you'd rather have H100 take over Enformer training, I can fire it
|
| 103 |
+
once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
|
| 104 |
+
Enformer should take ~12β15h. Just tell me the input-data path
|
| 105 |
+
(your /extra/zhanglab0 path) + epoch count and I'll launch
|
| 106 |
+
`slurm/run_oracle_enformer.sh`-equivalent locally. **But** β
|
| 107 |
+
**DeepSTARR-7cell at val_pearson=0.136** is what every T3 paper row
|
| 108 |
+
in the post-bench pipeline scores against, and it's "good enough" in
|
| 109 |
+
the sense that we use *deltas* (activity_delta_src,
|
| 110 |
+
activity_relative_shift) where weak oracles still rank correctly.
|
| 111 |
+
Enformer is a Table 4 cross-oracle robustness check, not the headline.
|
| 112 |
+
So skipping it for v1 is fine if it costs another 50h of GPU.
|
| 113 |
+
|
| 114 |
+
## Innovative-component work prioritisation
|
| 115 |
+
|
| 116 |
+
Per your note "make the whole pipeline with all innovative component
|
| 117 |
+
work is important", the pipeline today has every component wired:
|
| 118 |
+
|
| 119 |
+
| Component | Path | Status |
|
| 120 |
+
|---|---|---|
|
| 121 |
+
| LLaVA-mode fusion-SFT | `train_fusion_sft.py --architecture-mode llava` | β
default; lab 226075 |
|
| 122 |
+
| Unified-mode (NTP / MDLM) | `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` | β
wired; lab 226076/077 |
|
| 123 |
+
| Diffusion (LLaDA) | `--architecture-mode diffusion` | **NOT YET WIRED** (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") |
|
| 124 |
+
| RFT multi-turn | `scripts/rft_t3.py --rounds 4 --candidates 4` | β
; H100 Stage 3c |
|
| 125 |
+
| Reasoning expansion (OpenRouter Ling-1T) | `scripts/build_reasoning_traces.py` | β
; H100 333 T1 done, T2/T3 gated |
|
| 126 |
+
| Loop-SFT | `scripts/train_loop_sft.py` | β
wired; needs trajectory dataset |
|
| 127 |
+
| SV-GSPO | `scripts/train_sv_gspo.py` | β
; lab 226057 |
|
| 128 |
+
|
| 129 |
+
**The single missing innovative component is Phase 3 diffusion**.
|
| 130 |
+
Roughly 1β2 days of work (need to add a `DNAOutputHead` for LLaDA
|
| 131 |
+
+ collator changes + reverse-process sampler). I can take this on H100
|
| 132 |
+
once Stage 4 finishes, or drop into the Phase 4 ("after submission")
|
| 133 |
+
list. Your call.
|
| 134 |
+
|
| 135 |
+
## What I'm doing in the meantime (next 5 hours)
|
| 136 |
+
|
| 137 |
+
1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
|
| 138 |
+
2. Watching for your full-enriched HF push (~5h ETA per your message);
|
| 139 |
+
when it lands I'll pull and re-fire T2 zs bench on the new
|
| 140 |
+
enriched data.
|
| 141 |
+
3. T1 reasoning loop will keep accumulating (333/day β ~10k rows in
|
| 142 |
+
~30 days at single-key, faster if more keys).
|
| 143 |
+
|
| 144 |
+
After T3 bench finishes:
|
| 145 |
+
4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically.
|
| 146 |
+
5. I might also pivot to Phase 3 diffusion implementation if you're
|
| 147 |
+
OK with H100 spending ~2 days on it after Stage 4 lands.
|
| 148 |
+
|
| 149 |
+
## Ask
|
| 150 |
+
|
| 151 |
+
* Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline
|
| 152 |
+
joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO +
|
| 153 |
+
multi-encoder + T2 regen + Enformer.
|
| 154 |
+
* On Enformer hang: are you OK with H100 taking it over after T3
|
| 155 |
+
bench finishes? Or stick with DeepSTARR-7cell only and defer
|
| 156 |
+
Enformer to extended-paper review pass?
|
| 157 |
+
* On Phase 3 diffusion: drop now, or implement after H100's Stage 4
|
| 158 |
+
finishes? (~2 days.)
|
| 159 |
+
|
| 160 |
+
β H100 side
|