File size: 7,391 Bytes
8f07afa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Reply to lab β€” 2026-04-27 ~05:40 UTC

Status absorbed. The split looks right to me β€” **H100 and lab are not
duplicating** if we lock in this division of labour:

## H100 (single GPU)

Locked in to the **headline LLaVA-mode chain** + everything that
needs the multi-turn RFT / reasoning expansion / sanitiser commits
(all on `mllm-integrate-server2`):

* T3 zs_raw bench (36% in flight)
* T3 zs_enriched bench (queued)
* Stages 1–4 of `post_bench_pipeline.sh`: T1 / T2 / T3 / joint-multitask
  fusion-SFT in LLaVA mode (the **control** for your arch ablation)
* Stage 3b T3 reasoning-only ablation
* Stage 3c T3 multi-turn RFT + retrain
* Stage 3d T3 reasoning-trace expansion on post-RFT JSONL (gated)
* Stage 3e T2 reasoning-trace expansion (gated on your T2 regen marker)
* Stage 3f T1 reasoning-trace expansion (already accumulating;
  333/333 today, will keep accumulating daily)
* Reaper `slurm/genqual_reaper_v5.sh` auto-fires `eval_t3_oracle.py` +
  `run_generation_eval.py` on every predictions.jsonl as it lands β€”
  no wait for the post-bench pipeline.

So Table 1 row 4 ("Fusion SFT, LLaVA, per task") + the joint
multitask headline + T3-specific RFT/reasoning rows all come from H100.

## Lab cluster (multi-GPU)

Locked in to ablations + RL + the lab-only data pipeline:

* **226075/076/077** arch ablation (LLaVA / unified+ntp / unified+mdlm)
  β†’ **Table 3 Phase 2** row.
* **226057** sv_gspo_v5 β†’ **Table 1 row 5/6** (RL on top of Loop-SFT).
* **226086** NTv3-8m encoder β†’ multi-encoder grid for **Table 3 Β§4c**.
* **226090** T1+T3 full enriched JSONL HF upload β†’ unblocks H100's
  full-N benches once landed.
* **226095** T2 regen v4 β†’ unblocks H100's Stage 3e (T2 reasoning
  expansion) + a clean T2 re-bench with promoter+enhancer TFBS scans.

When 226090 + 226095 land on HF I'll grab them, re-fire T2 zero-shot
on the new enriched data (need the proper enhancer-side TFBS evidence
for the T2 paper row), and pivot Stage 3e to fire 333 T2 reasonings
immediately.

## On 226075-077 loss curves

Loss 6β†’8 with bf16 NaN at step 1000 eval β€” I see why you're worried,
but two notes:

1. The 92edaf7 fix (saving `final/pytorch_model.bin` always) means
   even if eval keeps NaN-ing, the final adapter is recoverable from
   the latest periodic-save checkpoint. Good.
2. The 17.6 β†’ 0.06 collapse on the earlier unified+ntp run was almost
   certainly the collator bug pre-`bda9ee0`. Now that the SFT collator
   sanitises before tokenisation, the model can no longer cheat on
   `peak_name=…` / `Observed dataset row …` / `label_source=…`. Loss
   floors should be closer to "real" (~3–5 at 1 epoch on 105k).

If the unified+mdlm run is still tracking similar to LLaVA at step
2000+, that itself is a paper-worthy result (architecture-mode
ablation says: same data + same encoder + LoRA β†’ DNA head doesn't
matter much, the LLM head is good enough). Worth committing to even
if curves stay noisy.

## 🚨 Enformer oracle (225956) β€” investigate

53h zero-progress is almost certainly a stuck dataloader or NFS hang,
not "almost done". Suggested debug, in order of effort:

```bash
# 1. Confirm the process is alive (if it's a zombie just kill it):
ps -o pid,stat,etime,cmd -p $(squeue -h -j 225956 -o %N | xargs -I{} ssh {} 'pgrep -f run_oracle_enformer'  || echo MISSING)

# 2. py-spy dump to see exactly where it's stuck (no GPU
#    interference β€” read-only stack inspection):
sudo py-spy dump --pid <pid>
# Common: stuck in DataLoader workers waiting on a NFS file open.

# 3. If py-spy points at a DataLoader worker:
#   - Kill the job. Re-launch with --num-workers=0 (single-process
#     dataloader; bypasses fork-NFS bug entirely) and --logging-steps 1
#     so we see step ticks immediately.
#   - That gets you 50% throughput vs 8 workers but at least
#     finishes.

# 4. If py-spy points at flash_attn or model.forward:
#    Probably a bf16 NaN in the warmup; restart with fp32 master
#    weights or --bf16=false for the first 500 steps.

# 5. Check the actual log for the ABSOLUTE LATEST line (not just
#    "Loading weights 100%"):
tail -1 /path/to/225956/log
ls -la /path/to/225956/log    # last-modified timestamp
# If timestamp is 53h ago, dataloader hang.
# If timestamp is < 5 min ago but logging is sparse, --logging-steps
# is mis-set (default 500 means 500 backward passes before first
# step log, which on Enformer is ~50 min).
```

If you'd rather have H100 take over Enformer training, I can fire it
once T3 bench finishes (~5h). H100 is fast enough that 20-epoch
Enformer should take ~12–15h. Just tell me the input-data path
(your /extra/zhanglab0 path) + epoch count and I'll launch
`slurm/run_oracle_enformer.sh`-equivalent locally. **But** β€”
**DeepSTARR-7cell at val_pearson=0.136** is what every T3 paper row
in the post-bench pipeline scores against, and it's "good enough" in
the sense that we use *deltas* (activity_delta_src,
activity_relative_shift) where weak oracles still rank correctly.
Enformer is a Table 4 cross-oracle robustness check, not the headline.
So skipping it for v1 is fine if it costs another 50h of GPU.

## Innovative-component work prioritisation

Per your note "make the whole pipeline with all innovative component
work is important", the pipeline today has every component wired:

| Component | Path | Status |
|---|---|---|
| LLaVA-mode fusion-SFT | `train_fusion_sft.py --architecture-mode llava` | βœ… default; lab 226075 |
| Unified-mode (NTP / MDLM) | `--architecture-mode unified --dna-loss-kind {ntp,mdlm}` | βœ… wired; lab 226076/077 |
| Diffusion (LLaDA) | `--architecture-mode diffusion` | **NOT YET WIRED** (`train_fusion_sft.py:88` says "Phase 3 = diffusion, not yet wired") |
| RFT multi-turn | `scripts/rft_t3.py --rounds 4 --candidates 4` | βœ…; H100 Stage 3c |
| Reasoning expansion (OpenRouter Ling-1T) | `scripts/build_reasoning_traces.py` | βœ…; H100 333 T1 done, T2/T3 gated |
| Loop-SFT | `scripts/train_loop_sft.py` | βœ… wired; needs trajectory dataset |
| SV-GSPO | `scripts/train_sv_gspo.py` | βœ…; lab 226057 |

**The single missing innovative component is Phase 3 diffusion**.
Roughly 1–2 days of work (need to add a `DNAOutputHead` for LLaDA
+ collator changes + reverse-process sampler). I can take this on H100
once Stage 4 finishes, or drop into the Phase 4 ("after submission")
list. Your call.

## What I'm doing in the meantime (next 5 hours)

1. Reaper auto-scoring T1 zs (in flight, ~25 min remaining).
2. Watching for your full-enriched HF push (~5h ETA per your message);
   when it lands I'll pull and re-fire T2 zs bench on the new
   enriched data.
3. T1 reasoning loop will keep accumulating (333/day β€” ~10k rows in
   ~30 days at single-key, faster if more keys).

After T3 bench finishes:
4. Stages 1/2/3/3b/3c/3d/3f/4 fire automatically.
5. I might also pivot to Phase 3 diffusion implementation if you're
   OK with H100 spending ~2 days on it after Stage 4 lands.

## Ask

* Confirm split: H100 owns LLaVA-mode + RFT + reasoning + headline
  joint multitask. Lab owns arch ablation (226075-077) + SV-GSPO +
  multi-encoder + T2 regen + Enformer.
* On Enformer hang: are you OK with H100 taking it over after T3
  bench finishes? Or stick with DeepSTARR-7cell only and defer
  Enformer to extended-paper review pass?
* On Phase 3 diffusion: drop now, or implement after H100's Stage 4
  finishes? (~2 days.)

β€” H100 side