explcre commited on
Commit
bc05167
Β·
verified Β·
1 Parent(s): aeb566f

Upload docs/experiment_chain_v5_unified.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/experiment_chain_v5_unified.md +249 -0
docs/experiment_chain_v5_unified.md ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Experiment chain β€” unified-MM LLM (paper-grade, v5)
2
+
3
+ Single document that ties together every `.py` and `.sh` we run from
4
+ zero-shot bench to final SV-GSPO checkpoint. v4 (in
5
+ `experiment_chain.md`) covered the per-task LLM progression. v5 adds
6
+ the unified-multimodal stack and the post-bench training pipeline that
7
+ auto-fires after the bench grid finishes.
8
+
9
+ Run order is the same as the order of stages in
10
+ `/dev/shm/dnathinker/post_bench_pipeline.sh` β€” the H100 just reads
11
+ that script top-to-bottom, no SLURM dependencies.
12
+
13
+ ## 0. Bench grid (zero-shot baselines)
14
+
15
+ | Stage | Script | Output | Purpose |
16
+ |-------|--------|--------|---------|
17
+ | ZS-T1 raw | `scripts/run_llm_benchmark_vllm.py --task enhancer_generation --prompt raw` | `runs/exp_t1_grid_*/zs_raw/{predictions,metrics}.json{,l}` | Paper Table 1 row 1 |
18
+ | ZS-T1 enriched | same w/ `--prompt enriched` | `runs/exp_t1_grid_*/zs_enriched/...` | Table 1 row 2 |
19
+ | ZS-T2 raw | `--task pair_prediction --prompt raw` | `runs/exp_t2_grid_*/zs_raw/...` | Table 1 row 1 (T2) |
20
+ | ZS-T2 enriched | same enriched | `runs/exp_t2_grid_*/zs_enriched/...` | Table 1 row 2 (T2) |
21
+ | ZS-T3 raw / enriched | `--task enhancer_editing` Γ— {raw, enriched} | `runs/exp_t3_grid_*/...` | Table 1 rows 1–2 (T3) |
22
+
23
+ Driver: `/dev/shm/dnathinker/launch_bench_vllm.sh` runs the 6 vLLM
24
+ benches sequentially. When the orchestrator PID exits, an attached
25
+ watcher fires `post_bench_pipeline.sh`.
26
+
27
+ ## 1. Post-bench pipeline (auto-triggered)
28
+
29
+ `/dev/shm/dnathinker/post_bench_pipeline.sh`. Each stage skip-checks
30
+ on its own output file, so re-runs are idempotent.
31
+
32
+ ### Stage 0 β€” ZS scoring (early HF push)
33
+
34
+ * `scripts/run_generation_eval.py` β†’ `genqual.json` (FBD / spec /
35
+ argmax-acc / per-cell-type) for T1+T3 zs_raw / zs_enriched.
36
+ * `scripts/eval_t3_oracle.py` β†’ `genqual_t3_oracle.json`
37
+ (within-budget, length-preserved, objective-success per
38
+ edit_type, per-cell-type) on T3 zs predictions.
39
+ * HF push of the partial bench results so lab can see numbers
40
+ before training stages finish.
41
+
42
+ ### Stages 1–4 β€” Fusion-SFT family (the headline)
43
+
44
+ Each `run_fusion` call invokes `scripts/train_fusion_sft.py` with
45
+ `--architecture-mode llava`, then **Stage Nb** invokes
46
+ `scripts/predict_fusion.py` on the trained adapter to get
47
+ predictions on the full test set, followed by
48
+ `run_generation_eval.py` (T1/T3) and `eval_t3_oracle.py` (T3 only).
49
+ These produce the `lora_raw` / `lora_enriched` rows in Table 1.
50
+
51
+ | Stage | Train script call | Inference + scoring | Paper row |
52
+ |-------|---|---|---|
53
+ | 1 | T1 fusion-SFT (n35k T1) | `score_adapter T1 ... raw / enriched` | T1 row 4 |
54
+ | 2 | T2 fusion-SFT (n35k T2 balanced) | `score_adapter T2 ... raw / enriched` | T2 row 4 |
55
+ | 3 | T3 fusion-SFT (n35k T3, heuristic gold) | `score_adapter T3 ... raw / enriched` | T3 row 4a |
56
+ | 3b | T3 reasoning-only SFT (`--mask-assistant-dna-span`) | same | T3 row 4b β€” paper ablation |
57
+ | 3c | T3 RFT (Stage A β†’ K candidates β†’ oracle-filter β†’ re-SFT) | same | T3 row 4c β€” paper ablation |
58
+ | 4 | **Joint multitask** fusion-SFT (105k = 35kΓ—3 balanced) | `score_adapter` Γ— {T1,T2,T3} Γ— {raw,enriched} | **headline row** β€” one model, three tasks |
59
+
60
+ `score_adapter` is defined inside `post_bench_pipeline.sh`. It exists
61
+ because `run_llm_benchmark.py --adapter-dir` expects PEFT format
62
+ (`adapter_model.bin` + `adapter_config.json`), and our
63
+ `FusionSFTTrainer` saves a **full** OneShotFusionLM state_dict (LLM +
64
+ LoRA + NTv3 projector + cell context encoder) via `torch.save`.
65
+ `predict_fusion.py` rebuilds the model and `load_state_dict`s it,
66
+ then runs `model.llm.generate` with the same prompt builder + parser
67
+ that `ZeroShotLLM.predict` uses, so `predictions.jsonl` is shape-
68
+ compatible with the genqual + T3-oracle scorers. This is the single
69
+ bridge between training output and the eval pipeline.
70
+
71
+ ### Stages 5–6 β€” NTv3-only baselines
72
+
73
+ * Stage 5: `scripts/train_generation.py --head mdlm` (NTv3-MDLM on T1).
74
+ * Stage 6: `scripts/train_ntv3_direct.py` (NTv3-direct on T2).
75
+ * "no LLM" rows in Table 1 β€” proves the LLM contributes signal.
76
+
77
+ ### Stage 7 β€” Aggregator + final HF push
78
+
79
+ * `aggregate_results.py` walks `runs/`, collapses
80
+ `(task, mode, prompt)` and writes
81
+ `/dev/shm/dnathinker/results/h100_snapshot.md`.
82
+ * HF push of metrics + genqual + h100_snapshot.md.
83
+
84
+ ## 2. Where Loop-SFT fits
85
+
86
+ Loop-SFT (`scripts/train_loop_sft.py`) is **not redundant** with RFT.
87
+ The two filter on different signals:
88
+
89
+ * **RFT** (Stage 3c): filter by *output objective* β€” generate K
90
+ candidates, keep ones whose **DNA sequence** satisfies budget + motif
91
+ + activity-shift via the oracle. Improves the **final answer**.
92
+ * **Loop-SFT**: filter by *trajectory* β€” keep traces whose
93
+ intermediate tool calls and reasoning chain are correct. Improves
94
+ the **reasoning chain that leads to the answer**.
95
+
96
+ The full T3 stack the paper aims for:
97
+
98
+ ```
99
+ Fusion-SFT (heuristic) β†’ Loop-SFT (trajectory-filtered) β†’ RFT (oracle-filtered) β†’ SV-GSPO (RL)
100
+ Stage A Stage A' Stage B Stage C
101
+ ```
102
+
103
+ Stage A' (Loop-SFT) is **deferred** to a follow-up run because the
104
+ trajectory-trace dataset (`16K v9` in `t3_evaluation_design.md` Β§10)
105
+ is the lab's, not the H100's. The H100 ships:
106
+ - Stage A (the three `run_fusion` calls)
107
+ - Stage A's reasoning-only ablation (3b) β€” equivalent to a
108
+ cold-start Loop-SFT with no traces; an ablation that shows
109
+ losing the heuristic DNA target doesn't tank the model
110
+ - Stage B (RFT, 3c)
111
+
112
+ When the lab finishes Loop-SFT on its side, the chain re-merges:
113
+ both teams point at the same `exp_t3_fusion_sft_*/final/pytorch_model.bin`,
114
+ the lab adds Loop-SFT on top, the H100 adds RFT on top, and we pick
115
+ whichever path scores higher on `eval_t3_oracle.py` for the paper.
116
+
117
+ ## 3. Job map (current state, 2026-04-27 UTC)
118
+
119
+ ```
120
+ H100 NVL
121
+ β”œβ”€β”€ PID 100474 launch_bench_vllm.sh (orchestrator)
122
+ β”‚ └── PID 121129 vLLM bench T2 zs_enriched (in flight)
123
+ β”‚ queued: T3 zs_raw, T3 zs_enriched
124
+ └── PID 100544 watcher β†’ post_bench_pipeline.sh (idle until 100474 exits)
125
+ ```
126
+
127
+ ETAs (rough, post-T2 enriched completion):
128
+ * T3 raw + T3 enriched bench: ~5h each (10h total)
129
+ * Stage 0 + 0c (genqual + T3 oracle on zs preds): ~30 min
130
+ * Stages 1–3 fusion-SFT (3 Γ— 35k Γ— 1 epoch on H100 NVL): ~6–8h total
131
+ * Stage 3b reasoning-only: ~3h
132
+ * Stage 3c RFT generate + filter + re-SFT: ~5h
133
+ * Stage 4 joint multitask 105k: ~10h
134
+ * Stages 5–6 NTv3-only: ~2h each
135
+ * Stage 7 aggregator + HF push: minutes
136
+
137
+ Total post-bench β‰ˆ 40 H100-hours. Tracked in
138
+ `runs/post_bench_pipeline.log` β€” `tail -f` for liveness.
139
+
140
+ ## 4. Paper-table β†’ script map (cheat sheet)
141
+
142
+ | Table 1 row | Numbers come from | Per-cell breakdown? |
143
+ |---|---|---|
144
+ | Row 1 (zs_raw) | `runs/exp_t{1,2,3}_grid_*/zs_raw/genqual/genqual.json` | yes |
145
+ | Row 2 (zs_enriched) | `.../zs_enriched/genqual/genqual.json` | yes |
146
+ | Row 3 (LoRA, no NTv3) | DEFERRED β€” not in current pipeline | |
147
+ | Row 4 (Fusion-SFT, per-task) | `runs/exp_t{1,2,3}_fusion_sft_*/predict_t{1,2,3}_{raw,enriched}/genqual/*.json` | yes (T1/T3); T2 has no per-cell β€” pair_prediction is binary |
148
+ | Row 4b (T3 reasoning-only) | `runs/exp_t3_fusion_sft_reasonly_*/predict_t3_*/genqual/...` | yes |
149
+ | Row 4c (T3 RFT) | `runs/exp_t3_fusion_sft_rft_*/predict_t3_*/genqual/...` | yes |
150
+ | **Headline (joint multitask)** | `runs/exp_joint_multitask_*/predict_t{1,2,3}_*/genqual/...` | yes |
151
+ | Row 5 (Loop-SFT) | lab side, slurm | |
152
+ | Row 6 (SV-GSPO) | lab side, slurm | |
153
+
154
+ T3-specific paper section uses the **objective-satisfaction** metrics
155
+ from `eval_t3_oracle.py` (`within_budget`, `length_preserved`,
156
+ `objective_success_*`, `transfer_specificity`,
157
+ `in_budget_at_{5,10,20}pct`), not the heuristic-overlap genqual ones β€”
158
+ see `t3_evaluation_design.md` Β§2 for why.
159
+
160
+ ## 5. Reasoning-trace augmentation (OpenRouter / Nemotron, free)
161
+
162
+ `scripts/build_reasoning_traces.py` rewrites the assistant turn in any
163
+ T1/T2/T3 SFT JSONL to include a single-shot rationale that wires the
164
+ enriched evidence (TFBS scan, expression context, motif hits) to the
165
+ gold answer. Output schema matches the parent project's existing
166
+ `pe_dataset_reasoning_expansion_*/jsonl/` files exactly:
167
+
168
+ ```
169
+ <reasoning_start>RATIONALE</reasoning_end>
170
+ <enhancer_dna_start>SEQ</enhancer_dna_end> # T1/T3
171
+ <pair_label>paired|not_paired</pair_label> # T2
172
+ ```
173
+
174
+ Reuses `regureasoner.loop.openrouter.OpenRouterClient` (same retry +
175
+ backoff client `expand_loop_trajectories.py` uses). Single API call
176
+ per row β€” the teacher only writes the *justification*, not the
177
+ answer, so small free-tier models (default
178
+ `nvidia/nemotron-nano-9b-v2:free`; switch to
179
+ `nvidia/llama-3.1-nemotron-70b-instruct:free` for richer rationales)
180
+ stay reliable.
181
+
182
+ **Resumable**: appends to the output JSONL; on startup it scans every
183
+ `id` already present and skips those rows in the source. Daily reruns
184
+ accumulate without overlap.
185
+
186
+ **Budget**: `--max-requests` (default 1000) is the per-invocation
187
+ cap. OpenRouter free tier = 1000 req/day per key. Multiple keys can
188
+ shard line-level via `--shard-index/--num-shards`.
189
+
190
+ **Daily-loop launcher**: `slurm/build_reasoning_traces_loop.sh` β€”
191
+ sources `OPENROUTER_API_KEY` from `/dev/shm/dnathinker/.env`, walks
192
+ T1/T2/T3 with `PER_TASK=333` each (β‰ˆ1000/day total), and
193
+ optionally `--daemon`s into a 24h sleep loop. Zero GPU; runs
194
+ alongside any training stage.
195
+
196
+ **SFT integration**: when β‰₯N augmented rows accumulate per task, point
197
+ `scripts/train_fusion_sft.py --train-jsonl` at
198
+ `/dev/shm/dnathinker/data/reasoning_traces/train.<task>.reasoning.jsonl`.
199
+ Same collator, same trainer β€” the only difference is the assistant
200
+ turn now starts with `<reasoning_start>...</reasoning_end>`, so the
201
+ trained model emits explicit rationale + answer at inference time.
202
+ This is the **paper's "reasoning model" row** in T3's table; the
203
+ non-reasoning fusion-SFT runs (Stages 1–3) stay as the
204
+ no-rationale comparison.
205
+
206
+ **Per-task source JSONL β€” what the teacher justifies**:
207
+
208
+ | Task | Source JSONL | Why |
209
+ |---|---|---|
210
+ | T1 | `train.enhancer_generation.strat7c.n35k.jsonl` (heuristic gold) | The heuristic gold is the empirical paired enhancer; teacher justifies why it pairs in this cell type. |
211
+ | T2 | `train.pair_prediction.strat7c.n35k.jsonl` (observed positive + pseudo-negative) | Teacher justifies the binary label using shared-TFBS / GC / expression evidence. |
212
+ | T3 | **post-RFT** `train.t3_rft.jsonl` | The heuristic gold for T3 is a synthetic motif-implant (not unique GT β€” see `t3_evaluation_design.md` Β§1). RFT (Stage 3c) replaces it with an oracle-validated candidate. Reasoning expansion **must run on the post-RFT JSONL** so the rationale justifies a sequence the oracle has actually scored, not the heuristic. Order: Fusion-SFT β†’ RFT β†’ reasoning expansion β†’ reasoning-augmented Fusion-SFT. |
213
+
214
+ The launcher `slurm/build_reasoning_traces_loop.sh` defaults to the
215
+ heuristic-gold JSONLs for T1/T2 and the heuristic-gold for T3, but
216
+ override `T3_SRC=/dev/shm/dnathinker/runs/exp_t3_fusion_sft_rft_${STAMP}/.../train.t3_rft.jsonl`
217
+ once Stage 3c finishes β€” the loop's resume logic handles a mid-run
218
+ source swap because the augmented output JSONL keeps row ids.
219
+
220
+ ## 6. Input sanitisation β€” applied globally before any model sees text
221
+
222
+ `regureasoner/utils/input_sanitize.py` (used by `PromptBuilder.user()`
223
+ and `build_reasoning_traces._format_user`) strips three classes of
224
+ issue at read-time, so we don't need to regenerate the prod JSONLs:
225
+
226
+ 1. **Label leaks** β€” `peak_name=chr…`, `enhancer_peak_name=chr…`,
227
+ the "Peak coordinates parsed to chr…:…" sentence, the "Observed
228
+ dataset row is a released paired/not_paired link …" sentence
229
+ (T2's biggest leak), and `label_source=…` lines.
230
+ 2. **Unexplained proxy scores** β€” `Evolution proxy score …
231
+ (expression_stability_proxy_v1)`, `promoter_likeness_score=…`,
232
+ `quality_score / repeat_fraction / kmer_entropy_norm` (these are
233
+ ad-hoc internal scores the model can't ground; we omit rather
234
+ than try to explain in-prompt).
235
+ 3. **Cell-type abbreviations** expanded β€” `cell_type=Ex` β†’
236
+ `cell_type=Excitatory neuron (Ex)` so the model knows the biology.
237
+
238
+ Applied before any model call. Idempotent β€” running it twice yields
239
+ the same string. 12 unit tests cover every leak/score family +
240
+ idempotency + cell-type expansion (`tests/test_input_sanitize.py`).
241
+
242
+ Why we don't run this *inside* `post_bench_pipeline.sh`: the script
243
+ is IO-bound (no GPU), capped at 1000 req/day, and meant to run for
244
+ **multiple days** in the background. Putting it in the GPU pipeline
245
+ would either waste a single 1000-call day or block the rest of the
246
+ pipeline waiting for accumulation. The right pattern is to launch
247
+ `build_reasoning_traces_loop.sh --daemon` once at the start of the
248
+ campaign and let it accumulate rows independently. When a critical
249
+ mass exists, fire a single fusion-SFT run on the augmented JSONL.