File size: 12,905 Bytes
bc05167 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | # Experiment chain β unified-MM LLM (paper-grade, v5)
Single document that ties together every `.py` and `.sh` we run from
zero-shot bench to final SV-GSPO checkpoint. v4 (in
`experiment_chain.md`) covered the per-task LLM progression. v5 adds
the unified-multimodal stack and the post-bench training pipeline that
auto-fires after the bench grid finishes.
Run order is the same as the order of stages in
`/dev/shm/dnathinker/post_bench_pipeline.sh` β the H100 just reads
that script top-to-bottom, no SLURM dependencies.
## 0. Bench grid (zero-shot baselines)
| Stage | Script | Output | Purpose |
|-------|--------|--------|---------|
| ZS-T1 raw | `scripts/run_llm_benchmark_vllm.py --task enhancer_generation --prompt raw` | `runs/exp_t1_grid_*/zs_raw/{predictions,metrics}.json{,l}` | Paper Table 1 row 1 |
| ZS-T1 enriched | same w/ `--prompt enriched` | `runs/exp_t1_grid_*/zs_enriched/...` | Table 1 row 2 |
| ZS-T2 raw | `--task pair_prediction --prompt raw` | `runs/exp_t2_grid_*/zs_raw/...` | Table 1 row 1 (T2) |
| ZS-T2 enriched | same enriched | `runs/exp_t2_grid_*/zs_enriched/...` | Table 1 row 2 (T2) |
| ZS-T3 raw / enriched | `--task enhancer_editing` Γ {raw, enriched} | `runs/exp_t3_grid_*/...` | Table 1 rows 1β2 (T3) |
Driver: `/dev/shm/dnathinker/launch_bench_vllm.sh` runs the 6 vLLM
benches sequentially. When the orchestrator PID exits, an attached
watcher fires `post_bench_pipeline.sh`.
## 1. Post-bench pipeline (auto-triggered)
`/dev/shm/dnathinker/post_bench_pipeline.sh`. Each stage skip-checks
on its own output file, so re-runs are idempotent.
### Stage 0 β ZS scoring (early HF push)
* `scripts/run_generation_eval.py` β `genqual.json` (FBD / spec /
argmax-acc / per-cell-type) for T1+T3 zs_raw / zs_enriched.
* `scripts/eval_t3_oracle.py` β `genqual_t3_oracle.json`
(within-budget, length-preserved, objective-success per
edit_type, per-cell-type) on T3 zs predictions.
* HF push of the partial bench results so lab can see numbers
before training stages finish.
### Stages 1β4 β Fusion-SFT family (the headline)
Each `run_fusion` call invokes `scripts/train_fusion_sft.py` with
`--architecture-mode llava`, then **Stage Nb** invokes
`scripts/predict_fusion.py` on the trained adapter to get
predictions on the full test set, followed by
`run_generation_eval.py` (T1/T3) and `eval_t3_oracle.py` (T3 only).
These produce the `lora_raw` / `lora_enriched` rows in Table 1.
| Stage | Train script call | Inference + scoring | Paper row |
|-------|---|---|---|
| 1 | T1 fusion-SFT (n35k T1) | `score_adapter T1 ... raw / enriched` | T1 row 4 |
| 2 | T2 fusion-SFT (n35k T2 balanced) | `score_adapter T2 ... raw / enriched` | T2 row 4 |
| 3 | T3 fusion-SFT (n35k T3, heuristic gold) | `score_adapter T3 ... raw / enriched` | T3 row 4a |
| 3b | T3 reasoning-only SFT (`--mask-assistant-dna-span`) | same | T3 row 4b β paper ablation |
| 3c | T3 RFT (Stage A β K candidates β oracle-filter β re-SFT) | same | T3 row 4c β paper ablation |
| 4 | **Joint multitask** fusion-SFT (105k = 35kΓ3 balanced) | `score_adapter` Γ {T1,T2,T3} Γ {raw,enriched} | **headline row** β one model, three tasks |
`score_adapter` is defined inside `post_bench_pipeline.sh`. It exists
because `run_llm_benchmark.py --adapter-dir` expects PEFT format
(`adapter_model.bin` + `adapter_config.json`), and our
`FusionSFTTrainer` saves a **full** OneShotFusionLM state_dict (LLM +
LoRA + NTv3 projector + cell context encoder) via `torch.save`.
`predict_fusion.py` rebuilds the model and `load_state_dict`s it,
then runs `model.llm.generate` with the same prompt builder + parser
that `ZeroShotLLM.predict` uses, so `predictions.jsonl` is shape-
compatible with the genqual + T3-oracle scorers. This is the single
bridge between training output and the eval pipeline.
### Stages 5β6 β NTv3-only baselines
* Stage 5: `scripts/train_generation.py --head mdlm` (NTv3-MDLM on T1).
* Stage 6: `scripts/train_ntv3_direct.py` (NTv3-direct on T2).
* "no LLM" rows in Table 1 β proves the LLM contributes signal.
### Stage 7 β Aggregator + final HF push
* `aggregate_results.py` walks `runs/`, collapses
`(task, mode, prompt)` and writes
`/dev/shm/dnathinker/results/h100_snapshot.md`.
* HF push of metrics + genqual + h100_snapshot.md.
## 2. Where Loop-SFT fits
Loop-SFT (`scripts/train_loop_sft.py`) is **not redundant** with RFT.
The two filter on different signals:
* **RFT** (Stage 3c): filter by *output objective* β generate K
candidates, keep ones whose **DNA sequence** satisfies budget + motif
+ activity-shift via the oracle. Improves the **final answer**.
* **Loop-SFT**: filter by *trajectory* β keep traces whose
intermediate tool calls and reasoning chain are correct. Improves
the **reasoning chain that leads to the answer**.
The full T3 stack the paper aims for:
```
Fusion-SFT (heuristic) β Loop-SFT (trajectory-filtered) β RFT (oracle-filtered) β SV-GSPO (RL)
Stage A Stage A' Stage B Stage C
```
Stage A' (Loop-SFT) is **deferred** to a follow-up run because the
trajectory-trace dataset (`16K v9` in `t3_evaluation_design.md` Β§10)
is the lab's, not the H100's. The H100 ships:
- Stage A (the three `run_fusion` calls)
- Stage A's reasoning-only ablation (3b) β equivalent to a
cold-start Loop-SFT with no traces; an ablation that shows
losing the heuristic DNA target doesn't tank the model
- Stage B (RFT, 3c)
When the lab finishes Loop-SFT on its side, the chain re-merges:
both teams point at the same `exp_t3_fusion_sft_*/final/pytorch_model.bin`,
the lab adds Loop-SFT on top, the H100 adds RFT on top, and we pick
whichever path scores higher on `eval_t3_oracle.py` for the paper.
## 3. Job map (current state, 2026-04-27 UTC)
```
H100 NVL
βββ PID 100474 launch_bench_vllm.sh (orchestrator)
β βββ PID 121129 vLLM bench T2 zs_enriched (in flight)
β queued: T3 zs_raw, T3 zs_enriched
βββ PID 100544 watcher β post_bench_pipeline.sh (idle until 100474 exits)
```
ETAs (rough, post-T2 enriched completion):
* T3 raw + T3 enriched bench: ~5h each (10h total)
* Stage 0 + 0c (genqual + T3 oracle on zs preds): ~30 min
* Stages 1β3 fusion-SFT (3 Γ 35k Γ 1 epoch on H100 NVL): ~6β8h total
* Stage 3b reasoning-only: ~3h
* Stage 3c RFT generate + filter + re-SFT: ~5h
* Stage 4 joint multitask 105k: ~10h
* Stages 5β6 NTv3-only: ~2h each
* Stage 7 aggregator + HF push: minutes
Total post-bench β 40 H100-hours. Tracked in
`runs/post_bench_pipeline.log` β `tail -f` for liveness.
## 4. Paper-table β script map (cheat sheet)
| Table 1 row | Numbers come from | Per-cell breakdown? |
|---|---|---|
| Row 1 (zs_raw) | `runs/exp_t{1,2,3}_grid_*/zs_raw/genqual/genqual.json` | yes |
| Row 2 (zs_enriched) | `.../zs_enriched/genqual/genqual.json` | yes |
| Row 3 (LoRA, no NTv3) | DEFERRED β not in current pipeline | |
| Row 4 (Fusion-SFT, per-task) | `runs/exp_t{1,2,3}_fusion_sft_*/predict_t{1,2,3}_{raw,enriched}/genqual/*.json` | yes (T1/T3); T2 has no per-cell β pair_prediction is binary |
| Row 4b (T3 reasoning-only) | `runs/exp_t3_fusion_sft_reasonly_*/predict_t3_*/genqual/...` | yes |
| Row 4c (T3 RFT) | `runs/exp_t3_fusion_sft_rft_*/predict_t3_*/genqual/...` | yes |
| **Headline (joint multitask)** | `runs/exp_joint_multitask_*/predict_t{1,2,3}_*/genqual/...` | yes |
| Row 5 (Loop-SFT) | lab side, slurm | |
| Row 6 (SV-GSPO) | lab side, slurm | |
T3-specific paper section uses the **objective-satisfaction** metrics
from `eval_t3_oracle.py` (`within_budget`, `length_preserved`,
`objective_success_*`, `transfer_specificity`,
`in_budget_at_{5,10,20}pct`), not the heuristic-overlap genqual ones β
see `t3_evaluation_design.md` Β§2 for why.
## 5. Reasoning-trace augmentation (OpenRouter / Nemotron, free)
`scripts/build_reasoning_traces.py` rewrites the assistant turn in any
T1/T2/T3 SFT JSONL to include a single-shot rationale that wires the
enriched evidence (TFBS scan, expression context, motif hits) to the
gold answer. Output schema matches the parent project's existing
`pe_dataset_reasoning_expansion_*/jsonl/` files exactly:
```
<reasoning_start>RATIONALE</reasoning_end>
<enhancer_dna_start>SEQ</enhancer_dna_end> # T1/T3
<pair_label>paired|not_paired</pair_label> # T2
```
Reuses `regureasoner.loop.openrouter.OpenRouterClient` (same retry +
backoff client `expand_loop_trajectories.py` uses). Single API call
per row β the teacher only writes the *justification*, not the
answer, so small free-tier models (default
`nvidia/nemotron-nano-9b-v2:free`; switch to
`nvidia/llama-3.1-nemotron-70b-instruct:free` for richer rationales)
stay reliable.
**Resumable**: appends to the output JSONL; on startup it scans every
`id` already present and skips those rows in the source. Daily reruns
accumulate without overlap.
**Budget**: `--max-requests` (default 1000) is the per-invocation
cap. OpenRouter free tier = 1000 req/day per key. Multiple keys can
shard line-level via `--shard-index/--num-shards`.
**Daily-loop launcher**: `slurm/build_reasoning_traces_loop.sh` β
sources `OPENROUTER_API_KEY` from `/dev/shm/dnathinker/.env`, walks
T1/T2/T3 with `PER_TASK=333` each (β1000/day total), and
optionally `--daemon`s into a 24h sleep loop. Zero GPU; runs
alongside any training stage.
**SFT integration**: when β₯N augmented rows accumulate per task, point
`scripts/train_fusion_sft.py --train-jsonl` at
`/dev/shm/dnathinker/data/reasoning_traces/train.<task>.reasoning.jsonl`.
Same collator, same trainer β the only difference is the assistant
turn now starts with `<reasoning_start>...</reasoning_end>`, so the
trained model emits explicit rationale + answer at inference time.
This is the **paper's "reasoning model" row** in T3's table; the
non-reasoning fusion-SFT runs (Stages 1β3) stay as the
no-rationale comparison.
**Per-task source JSONL β what the teacher justifies**:
| Task | Source JSONL | Why |
|---|---|---|
| T1 | `train.enhancer_generation.strat7c.n35k.jsonl` (heuristic gold) | The heuristic gold is the empirical paired enhancer; teacher justifies why it pairs in this cell type. |
| T2 | `train.pair_prediction.strat7c.n35k.jsonl` (observed positive + pseudo-negative) | Teacher justifies the binary label using shared-TFBS / GC / expression evidence. |
| T3 | **post-RFT** `train.t3_rft.jsonl` | The heuristic gold for T3 is a synthetic motif-implant (not unique GT β see `t3_evaluation_design.md` Β§1). RFT (Stage 3c) replaces it with an oracle-validated candidate. Reasoning expansion **must run on the post-RFT JSONL** so the rationale justifies a sequence the oracle has actually scored, not the heuristic. Order: Fusion-SFT β RFT β reasoning expansion β reasoning-augmented Fusion-SFT. |
The launcher `slurm/build_reasoning_traces_loop.sh` defaults to the
heuristic-gold JSONLs for T1/T2 and the heuristic-gold for T3, but
override `T3_SRC=/dev/shm/dnathinker/runs/exp_t3_fusion_sft_rft_${STAMP}/.../train.t3_rft.jsonl`
once Stage 3c finishes β the loop's resume logic handles a mid-run
source swap because the augmented output JSONL keeps row ids.
## 6. Input sanitisation β applied globally before any model sees text
`regureasoner/utils/input_sanitize.py` (used by `PromptBuilder.user()`
and `build_reasoning_traces._format_user`) strips three classes of
issue at read-time, so we don't need to regenerate the prod JSONLs:
1. **Label leaks** β `peak_name=chrβ¦`, `enhancer_peak_name=chrβ¦`,
the "Peak coordinates parsed to chrβ¦:β¦" sentence, the "Observed
dataset row is a released paired/not_paired link β¦" sentence
(T2's biggest leak), and `label_source=β¦` lines.
2. **Unexplained proxy scores** β `Evolution proxy score β¦
(expression_stability_proxy_v1)`, `promoter_likeness_score=β¦`,
`quality_score / repeat_fraction / kmer_entropy_norm` (these are
ad-hoc internal scores the model can't ground; we omit rather
than try to explain in-prompt).
3. **Cell-type abbreviations** expanded β `cell_type=Ex` β
`cell_type=Excitatory neuron (Ex)` so the model knows the biology.
Applied before any model call. Idempotent β running it twice yields
the same string. 12 unit tests cover every leak/score family +
idempotency + cell-type expansion (`tests/test_input_sanitize.py`).
Why we don't run this *inside* `post_bench_pipeline.sh`: the script
is IO-bound (no GPU), capped at 1000 req/day, and meant to run for
**multiple days** in the background. Putting it in the GPU pipeline
would either waste a single 1000-call day or block the rest of the
pipeline waiting for accumulation. The right pattern is to launch
`build_reasoning_traces_loop.sh --daemon` once at the start of the
campaign and let it accumulate rows independently. When a critical
mass exists, fire a single fusion-SFT run on the augmented JSONL.
|