explcre commited on
Commit
8f07afa
Β·
verified Β·
1 Parent(s): f1c872a

Upload lab_message_2026_04_27_v3.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. lab_message_2026_04_27_v3.md +160 -0
lab_message_2026_04_27_v3.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reply to lab β€” 2026-04-27 ~05:40 UTC
2
+
3
+ Status absorbed. The split looks right to me β€” **H100 and lab are not
4
+ duplicating** if we lock in this division of labour:
5
+
6
+ ## H100 (single GPU)
7
+
8
+ Locked in to the **headline LLaVA-mode chain** + everything that
9
+ needs the multi-turn RFT / reasoning expansion / sanitiser commits
10
+ (all on `mllm-integrate-server2`):
11
+
12
+ * T3 zs_raw bench (36% in flight)
13
+ * T3 zs_enriched bench (queued)
14
+ * Stages 1–4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask
15
+ fusion-SFT in LLaVA mode (the **control** for your arch ablation)
16
+ * Stage 3b T3 reasoning-only ablation
17
+ * Stage 3c T3 multi-turn RFT + retrain
18
+ * Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
19
+ * Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
20
+ * Stage 3f T1 reasoning-trace expansion (already accumulating;
21
+ 333/333 today, will keep accumulating daily)
22
+ * Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` +
23
+ `run_generation_eval.py` on every predictions.jsonl as it lands β€”
24
+ no wait for the post-bench pipeline.
25
+
26
+ So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint
27
+ multitask headline + T3-specific RFT/reasoning rows all come from H100.
28
+
29
+ ## Lab cluster (multi-GPU)
30
+
31
+ Locked in to ablations + RL + the lab-only data pipeline:
32
+
33
+ * **226075/076/077** arch ablation (LLaVA / unified+ntp / unified+mdlm)
34
+ β†’ **Table 3 Phase 2** row.
35
+ * **226057** sv_gspo_v5 β†’ **Table 1 row 5/6** (RL on top of Loop-SFT).
36
+ * **226086** NTv3-8m encoder β†’ multi-encoder grid for **Table 3 Β§4c**.
37
+ * **226090** T1+T3 full enriched JSONL HF upload β†’ unblocks H100's
38
+ full-N benches once landed.
39
+ * **226095** T2 regen v4 β†’ unblocks H100's Stage 3e (T2 reasoning
40
+ expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.
41
+
42
+ When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot
43
+ on the new enriched data (need the proper enhancer-side TFBS evidence
44
+ for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings
45
+ immediately.
46
+
47
+ ## On 226075-077 loss curves
48
+
49
+ Loss 6β†’8 with bf16 NaN at step 1000 eval β€” I see why you're worried,
50
+ but two notes:
51
+
52
+ 1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means
53
+ even if eval keeps NaN-ing, the final adapter is recoverable from
54
+ the latest periodic-save checkpoint. Good.
55
+ 2. The 17.6 β†’ 0.06 collapse on the earlier unified+ntp run was almost
56
+ certainly the collator bug pre-`bda9ee0`. Now that the SFT collator
57
+ sanitises before tokenisation, the model can no longer cheat on
58
+ `peak_name=…` / `Observed dataset row …` / `label_source=…`. Loss
59
+ floors should be closer to "real" (~3–5 at 1 epoch on 105k).
60
+
61
+ If the unified+mdlm run is still tracking similar to LLaVA at step
62
+ 2000+, that itself is a paper-worthy result (architecture-mode
63
+ ablation says: same data + same encoder + LoRA β†’ DNA head doesn't
64
+ matter much, the LLM head is good enough). Worth committing to even
65
+ if curves stay noisy.
66
+
67
+ ## 🚨 Enformer oracle (225956) β€” investigate
68
+
69
+ 53h zero-progress is almost certainly a stuck dataloader or NFS hang,
70
+ not "almost done". Suggested debug, in order of effort:
71
+
72
+ ```bash
73
+ # 1. Confirm the process is alive (if it's a zombie just kill it):
74
+ ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer' || echo MISSING)
75
+
76
+ # 2. py-spy dump to see exactly where it's stuck (no GPU
77
+ # interference β€” read-only stack inspection):
78
+ sudo py-spy dump --pid <pid>
79
+ # Common: stuck in DataLoader workers waiting on a NFS file open.
80
+
81
+ # 3. If py-spy points at a DataLoader worker:
82
+ # - Kill the job. Re-launch with --num-workers=0 (single-process
83
+ # dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
84
+ # so we see step ticks immediately.
85
+ # - That gets you 50% throughput vs 8 workers but at least
86
+ # finishes.
87
+
88
+ # 4. If py-spy points at flash_attn or model.forward:
89
+ # Probably a bf16 NaN in the warmup; restart with fp32 master
90
+ # weights or --bf16=false for the first 500 steps.
91
+
92
+ # 5. Check the actual log for the ABSOLUTE LATEST line (not just
93
+ # "Loading weights 100%"):
94
+ tail -1 /path/to/225956/log
95
+ ls -la /path/to/225956/log # last-modified timestamp
96
+ # If timestamp is 53h ago, dataloader hang.
97
+ # If timestamp is < 5 min ago but logging is sparse, --logging-steps
98
+ # is mis-set (default 500 means 500 backward passes before first
99
+ # step log, which on Enformer is ~50 min).
100
+ ```
101
+
102
+ If you'd rather have H100 take over Enformer training, I can fire it
103
+ once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
104
+ Enformer should take ~12–15h. Just tell me the input-data path
105
+ (your /extra/zhanglab0 path) + epoch count and I'll launch
106
+ `slurm/run_oracle_enformer.sh`-equivalent locally. **But** β€”
107
+ **DeepSTARR-7cell at val_pearson=0.136** is what every T3 paper row
108
+ in the post-bench pipeline scores against, and it's "good enough" in
109
+ the sense that we use *deltas* (activity_delta_src,
110
+ activity_relative_shift) where weak oracles still rank correctly.
111
+ Enformer is a Table 4 cross-oracle robustness check, not the headline.
112
+ So skipping it for v1 is fine if it costs another 50h of GPU.
113
+
114
+ ## Innovative-component work prioritisation
115
+
116
+ Per your note "make the whole pipeline with all innovative component
117
+ work is important", the pipeline today has every component wired:
118
+
119
+ | Component | Path | Status |
120
+ |---|---|---|
121
+ | LLaVA-mode fusion-SFT | `train_fusion_sft.py --architecture-mode llava` | βœ… default; lab 226075 |
122
+ | Unified-mode (NTP / MDLM) | `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` | βœ… wired; lab 226076/077 |
123
+ | Diffusion (LLaDA) | `--architecture-mode diffusion` | **NOT YET WIRED** (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") |
124
+ | RFT multi-turn | `scripts/rft_t3.py --rounds 4 --candidates 4` | βœ…; H100 Stage 3c |
125
+ | Reasoning expansion (OpenRouter Ling-1T) | `scripts/build_reasoning_traces.py` | βœ…; H100 333 T1 done, T2/T3 gated |
126
+ | Loop-SFT | `scripts/train_loop_sft.py` | βœ… wired; needs trajectory dataset |
127
+ | SV-GSPO | `scripts/train_sv_gspo.py` | βœ…; lab 226057 |
128
+
129
+ **The single missing innovative component is Phase 3 diffusion**.
130
+ Roughly 1–2 days of work (need to add a `DNAOutputHead` for LLaDA
131
+ + collator changes + reverse-process sampler). I can take this on H100
132
+ once Stage 4 finishes, or drop into the Phase 4 ("after submission")
133
+ list. Your call.
134
+
135
+ ## What I'm doing in the meantime (next 5 hours)
136
+
137
+ 1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
138
+ 2. Watching for your full-enriched HF push (~5h ETA per your message);
139
+ when it lands I'll pull and re-fire T2 zs bench on the new
140
+ enriched data.
141
+ 3. T1 reasoning loop will keep accumulating (333/day β€” ~10k rows in
142
+ ~30 days at single-key, faster if more keys).
143
+
144
+ After T3 bench finishes:
145
+ 4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically.
146
+ 5. I might also pivot to Phase 3 diffusion implementation if you're
147
+ OK with H100 spending ~2 days on it after Stage 4 lands.
148
+
149
+ ## Ask
150
+
151
+ * Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline
152
+ joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO +
153
+ multi-encoder + T2 regen + Enformer.
154
+ * On Enformer hang: are you OK with H100 taking it over after T3
155
+ bench finishes? Or stick with DeepSTARR-7cell only and defer
156
+ Enformer to extended-paper review pass?
157
+ * On Phase 3 diffusion: drop now, or implement after H100's Stage 4
158
+ finishes? (~2 days.)
159
+
160
+ β€” H100 side