Upload docs/experiment_chain_v5_unified.md with huggingface_hub
Browse files
docs/experiment_chain_v5_unified.md
ADDED
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Experiment chain β unified-MM LLM (paper-grade, v5)
|
| 2 |
+
|
| 3 |
+
Single document that ties together every `.py` and `.sh` we run from
|
| 4 |
+
zero-shot bench to final SV-GSPO checkpoint. v4 (in
|
| 5 |
+
`experiment_chain.md`) covered the per-task LLM progression. v5 adds
|
| 6 |
+
the unified-multimodal stack and the post-bench training pipeline that
|
| 7 |
+
auto-fires after the bench grid finishes.
|
| 8 |
+
|
| 9 |
+
Run order is the same as the order of stages in
|
| 10 |
+
`/dev/shm/dnathinker/post_bench_pipeline.sh` β the H100 just reads
|
| 11 |
+
that script top-to-bottom, no SLURM dependencies.
|
| 12 |
+
|
| 13 |
+
## 0. Bench grid (zero-shot baselines)
|
| 14 |
+
|
| 15 |
+
| Stage | Script | Output | Purpose |
|
| 16 |
+
|-------|--------|--------|---------|
|
| 17 |
+
| ZS-T1 raw | `scripts/run_llm_benchmark_vllm.py --task enhancer_generation --prompt raw` | `runs/exp_t1_grid_*/zs_raw/{predictions,metrics}.json{,l}` | Paper Table 1 row 1 |
|
| 18 |
+
| ZS-T1 enriched | same w/ `--prompt enriched` | `runs/exp_t1_grid_*/zs_enriched/...` | Table 1 row 2 |
|
| 19 |
+
| ZS-T2 raw | `--task pair_prediction --prompt raw` | `runs/exp_t2_grid_*/zs_raw/...` | Table 1 row 1 (T2) |
|
| 20 |
+
| ZS-T2 enriched | same enriched | `runs/exp_t2_grid_*/zs_enriched/...` | Table 1 row 2 (T2) |
|
| 21 |
+
| ZS-T3 raw / enriched | `--task enhancer_editing` Γ {raw, enriched} | `runs/exp_t3_grid_*/...` | Table 1 rows 1β2 (T3) |
|
| 22 |
+
|
| 23 |
+
Driver: `/dev/shm/dnathinker/launch_bench_vllm.sh` runs the 6 vLLM
|
| 24 |
+
benches sequentially. When the orchestrator PID exits, an attached
|
| 25 |
+
watcher fires `post_bench_pipeline.sh`.
|
| 26 |
+
|
| 27 |
+
## 1. Post-bench pipeline (auto-triggered)
|
| 28 |
+
|
| 29 |
+
`/dev/shm/dnathinker/post_bench_pipeline.sh`. Each stage skip-checks
|
| 30 |
+
on its own output file, so re-runs are idempotent.
|
| 31 |
+
|
| 32 |
+
### Stage 0 β ZS scoring (early HF push)
|
| 33 |
+
|
| 34 |
+
* `scripts/run_generation_eval.py` β `genqual.json` (FBD / spec /
|
| 35 |
+
argmax-acc / per-cell-type) for T1+T3 zs_raw / zs_enriched.
|
| 36 |
+
* `scripts/eval_t3_oracle.py` β `genqual_t3_oracle.json`
|
| 37 |
+
(within-budget, length-preserved, objective-success per
|
| 38 |
+
edit_type, per-cell-type) on T3 zs predictions.
|
| 39 |
+
* HF push of the partial bench results so lab can see numbers
|
| 40 |
+
before training stages finish.
|
| 41 |
+
|
| 42 |
+
### Stages 1β4 β Fusion-SFT family (the headline)
|
| 43 |
+
|
| 44 |
+
Each `run_fusion` call invokes `scripts/train_fusion_sft.py` with
|
| 45 |
+
`--architecture-mode llava`, then **Stage Nb** invokes
|
| 46 |
+
`scripts/predict_fusion.py` on the trained adapter to get
|
| 47 |
+
predictions on the full test set, followed by
|
| 48 |
+
`run_generation_eval.py` (T1/T3) and `eval_t3_oracle.py` (T3 only).
|
| 49 |
+
These produce the `lora_raw` / `lora_enriched` rows in Table 1.
|
| 50 |
+
|
| 51 |
+
| Stage | Train script call | Inference + scoring | Paper row |
|
| 52 |
+
|-------|---|---|---|
|
| 53 |
+
| 1 | T1 fusion-SFT (n35k T1) | `score_adapter T1 ... raw / enriched` | T1 row 4 |
|
| 54 |
+
| 2 | T2 fusion-SFT (n35k T2 balanced) | `score_adapter T2 ... raw / enriched` | T2 row 4 |
|
| 55 |
+
| 3 | T3 fusion-SFT (n35k T3, heuristic gold) | `score_adapter T3 ... raw / enriched` | T3 row 4a |
|
| 56 |
+
| 3b | T3 reasoning-only SFT (`--mask-assistant-dna-span`) | same | T3 row 4b β paper ablation |
|
| 57 |
+
| 3c | T3 RFT (Stage A β K candidates β oracle-filter β re-SFT) | same | T3 row 4c β paper ablation |
|
| 58 |
+
| 4 | **Joint multitask** fusion-SFT (105k = 35kΓ3 balanced) | `score_adapter` Γ {T1,T2,T3} Γ {raw,enriched} | **headline row** β one model, three tasks |
|
| 59 |
+
|
| 60 |
+
`score_adapter` is defined inside `post_bench_pipeline.sh`. It exists
|
| 61 |
+
because `run_llm_benchmark.py --adapter-dir` expects PEFT format
|
| 62 |
+
(`adapter_model.bin` + `adapter_config.json`), and our
|
| 63 |
+
`FusionSFTTrainer` saves a **full** OneShotFusionLM state_dict (LLM +
|
| 64 |
+
LoRA + NTv3 projector + cell context encoder) via `torch.save`.
|
| 65 |
+
`predict_fusion.py` rebuilds the model and `load_state_dict`s it,
|
| 66 |
+
then runs `model.llm.generate` with the same prompt builder + parser
|
| 67 |
+
that `ZeroShotLLM.predict` uses, so `predictions.jsonl` is shape-
|
| 68 |
+
compatible with the genqual + T3-oracle scorers. This is the single
|
| 69 |
+
bridge between training output and the eval pipeline.
|
| 70 |
+
|
| 71 |
+
### Stages 5β6 β NTv3-only baselines
|
| 72 |
+
|
| 73 |
+
* Stage 5: `scripts/train_generation.py --head mdlm` (NTv3-MDLM on T1).
|
| 74 |
+
* Stage 6: `scripts/train_ntv3_direct.py` (NTv3-direct on T2).
|
| 75 |
+
* "no LLM" rows in Table 1 β proves the LLM contributes signal.
|
| 76 |
+
|
| 77 |
+
### Stage 7 β Aggregator + final HF push
|
| 78 |
+
|
| 79 |
+
* `aggregate_results.py` walks `runs/`, collapses
|
| 80 |
+
`(task, mode, prompt)` and writes
|
| 81 |
+
`/dev/shm/dnathinker/results/h100_snapshot.md`.
|
| 82 |
+
* HF push of metrics + genqual + h100_snapshot.md.
|
| 83 |
+
|
| 84 |
+
## 2. Where Loop-SFT fits
|
| 85 |
+
|
| 86 |
+
Loop-SFT (`scripts/train_loop_sft.py`) is **not redundant** with RFT.
|
| 87 |
+
The two filter on different signals:
|
| 88 |
+
|
| 89 |
+
* **RFT** (Stage 3c): filter by *output objective* β generate K
|
| 90 |
+
candidates, keep ones whose **DNA sequence** satisfies budget + motif
|
| 91 |
+
+ activity-shift via the oracle. Improves the **final answer**.
|
| 92 |
+
* **Loop-SFT**: filter by *trajectory* β keep traces whose
|
| 93 |
+
intermediate tool calls and reasoning chain are correct. Improves
|
| 94 |
+
the **reasoning chain that leads to the answer**.
|
| 95 |
+
|
| 96 |
+
The full T3 stack the paper aims for:
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
+
Fusion-SFT (heuristic) β Loop-SFT (trajectory-filtered) β RFT (oracle-filtered) β SV-GSPO (RL)
|
| 100 |
+
Stage A Stage A' Stage B Stage C
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
Stage A' (Loop-SFT) is **deferred** to a follow-up run because the
|
| 104 |
+
trajectory-trace dataset (`16K v9` in `t3_evaluation_design.md` Β§10)
|
| 105 |
+
is the lab's, not the H100's. The H100 ships:
|
| 106 |
+
- Stage A (the three `run_fusion` calls)
|
| 107 |
+
- Stage A's reasoning-only ablation (3b) β equivalent to a
|
| 108 |
+
cold-start Loop-SFT with no traces; an ablation that shows
|
| 109 |
+
losing the heuristic DNA target doesn't tank the model
|
| 110 |
+
- Stage B (RFT, 3c)
|
| 111 |
+
|
| 112 |
+
When the lab finishes Loop-SFT on its side, the chain re-merges:
|
| 113 |
+
both teams point at the same `exp_t3_fusion_sft_*/final/pytorch_model.bin`,
|
| 114 |
+
the lab adds Loop-SFT on top, the H100 adds RFT on top, and we pick
|
| 115 |
+
whichever path scores higher on `eval_t3_oracle.py` for the paper.
|
| 116 |
+
|
| 117 |
+
## 3. Job map (current state, 2026-04-27 UTC)
|
| 118 |
+
|
| 119 |
+
```
|
| 120 |
+
H100 NVL
|
| 121 |
+
βββ PID 100474 launch_bench_vllm.sh (orchestrator)
|
| 122 |
+
β βββ PID 121129 vLLM bench T2 zs_enriched (in flight)
|
| 123 |
+
β queued: T3 zs_raw, T3 zs_enriched
|
| 124 |
+
βββ PID 100544 watcher β post_bench_pipeline.sh (idle until 100474 exits)
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
ETAs (rough, post-T2 enriched completion):
|
| 128 |
+
* T3 raw + T3 enriched bench: ~5h each (10h total)
|
| 129 |
+
* Stage 0 + 0c (genqual + T3 oracle on zs preds): ~30 min
|
| 130 |
+
* Stages 1β3 fusion-SFT (3 Γ 35k Γ 1 epoch on H100 NVL): ~6β8h total
|
| 131 |
+
* Stage 3b reasoning-only: ~3h
|
| 132 |
+
* Stage 3c RFT generate + filter + re-SFT: ~5h
|
| 133 |
+
* Stage 4 joint multitask 105k: ~10h
|
| 134 |
+
* Stages 5β6 NTv3-only: ~2h each
|
| 135 |
+
* Stage 7 aggregator + HF push: minutes
|
| 136 |
+
|
| 137 |
+
Total post-bench β 40 H100-hours. Tracked in
|
| 138 |
+
`runs/post_bench_pipeline.log` β `tail -f` for liveness.
|
| 139 |
+
|
| 140 |
+
## 4. Paper-table β script map (cheat sheet)
|
| 141 |
+
|
| 142 |
+
| Table 1 row | Numbers come from | Per-cell breakdown? |
|
| 143 |
+
|---|---|---|
|
| 144 |
+
| Row 1 (zs_raw) | `runs/exp_t{1,2,3}_grid_*/zs_raw/genqual/genqual.json` | yes |
|
| 145 |
+
| Row 2 (zs_enriched) | `.../zs_enriched/genqual/genqual.json` | yes |
|
| 146 |
+
| Row 3 (LoRA, no NTv3) | DEFERRED β not in current pipeline | |
|
| 147 |
+
| Row 4 (Fusion-SFT, per-task) | `runs/exp_t{1,2,3}_fusion_sft_*/predict_t{1,2,3}_{raw,enriched}/genqual/*.json` | yes (T1/T3); T2 has no per-cell β pair_prediction is binary |
|
| 148 |
+
| Row 4b (T3 reasoning-only) | `runs/exp_t3_fusion_sft_reasonly_*/predict_t3_*/genqual/...` | yes |
|
| 149 |
+
| Row 4c (T3 RFT) | `runs/exp_t3_fusion_sft_rft_*/predict_t3_*/genqual/...` | yes |
|
| 150 |
+
| **Headline (joint multitask)** | `runs/exp_joint_multitask_*/predict_t{1,2,3}_*/genqual/...` | yes |
|
| 151 |
+
| Row 5 (Loop-SFT) | lab side, slurm | |
|
| 152 |
+
| Row 6 (SV-GSPO) | lab side, slurm | |
|
| 153 |
+
|
| 154 |
+
T3-specific paper section uses the **objective-satisfaction** metrics
|
| 155 |
+
from `eval_t3_oracle.py` (`within_budget`, `length_preserved`,
|
| 156 |
+
`objective_success_*`, `transfer_specificity`,
|
| 157 |
+
`in_budget_at_{5,10,20}pct`), not the heuristic-overlap genqual ones β
|
| 158 |
+
see `t3_evaluation_design.md` Β§2 for why.
|
| 159 |
+
|
| 160 |
+
## 5. Reasoning-trace augmentation (OpenRouter / Nemotron, free)
|
| 161 |
+
|
| 162 |
+
`scripts/build_reasoning_traces.py` rewrites the assistant turn in any
|
| 163 |
+
T1/T2/T3 SFT JSONL to include a single-shot rationale that wires the
|
| 164 |
+
enriched evidence (TFBS scan, expression context, motif hits) to the
|
| 165 |
+
gold answer. Output schema matches the parent project's existing
|
| 166 |
+
`pe_dataset_reasoning_expansion_*/jsonl/` files exactly:
|
| 167 |
+
|
| 168 |
+
```
|
| 169 |
+
<reasoning_start>RATIONALE</reasoning_end>
|
| 170 |
+
<enhancer_dna_start>SEQ</enhancer_dna_end> # T1/T3
|
| 171 |
+
<pair_label>paired|not_paired</pair_label> # T2
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
Reuses `regureasoner.loop.openrouter.OpenRouterClient` (same retry +
|
| 175 |
+
backoff client `expand_loop_trajectories.py` uses). Single API call
|
| 176 |
+
per row β the teacher only writes the *justification*, not the
|
| 177 |
+
answer, so small free-tier models (default
|
| 178 |
+
`nvidia/nemotron-nano-9b-v2:free`; switch to
|
| 179 |
+
`nvidia/llama-3.1-nemotron-70b-instruct:free` for richer rationales)
|
| 180 |
+
stay reliable.
|
| 181 |
+
|
| 182 |
+
**Resumable**: appends to the output JSONL; on startup it scans every
|
| 183 |
+
`id` already present and skips those rows in the source. Daily reruns
|
| 184 |
+
accumulate without overlap.
|
| 185 |
+
|
| 186 |
+
**Budget**: `--max-requests` (default 1000) is the per-invocation
|
| 187 |
+
cap. OpenRouter free tier = 1000 req/day per key. Multiple keys can
|
| 188 |
+
shard line-level via `--shard-index/--num-shards`.
|
| 189 |
+
|
| 190 |
+
**Daily-loop launcher**: `slurm/build_reasoning_traces_loop.sh` β
|
| 191 |
+
sources `OPENROUTER_API_KEY` from `/dev/shm/dnathinker/.env`, walks
|
| 192 |
+
T1/T2/T3 with `PER_TASK=333` each (β1000/day total), and
|
| 193 |
+
optionally `--daemon`s into a 24h sleep loop. Zero GPU; runs
|
| 194 |
+
alongside any training stage.
|
| 195 |
+
|
| 196 |
+
**SFT integration**: when β₯N augmented rows accumulate per task, point
|
| 197 |
+
`scripts/train_fusion_sft.py --train-jsonl` at
|
| 198 |
+
`/dev/shm/dnathinker/data/reasoning_traces/train.<task>.reasoning.jsonl`.
|
| 199 |
+
Same collator, same trainer β the only difference is the assistant
|
| 200 |
+
turn now starts with `<reasoning_start>...</reasoning_end>`, so the
|
| 201 |
+
trained model emits explicit rationale + answer at inference time.
|
| 202 |
+
This is the **paper's "reasoning model" row** in T3's table; the
|
| 203 |
+
non-reasoning fusion-SFT runs (Stages 1β3) stay as the
|
| 204 |
+
no-rationale comparison.
|
| 205 |
+
|
| 206 |
+
**Per-task source JSONL β what the teacher justifies**:
|
| 207 |
+
|
| 208 |
+
| Task | Source JSONL | Why |
|
| 209 |
+
|---|---|---|
|
| 210 |
+
| T1 | `train.enhancer_generation.strat7c.n35k.jsonl` (heuristic gold) | The heuristic gold is the empirical paired enhancer; teacher justifies why it pairs in this cell type. |
|
| 211 |
+
| T2 | `train.pair_prediction.strat7c.n35k.jsonl` (observed positive + pseudo-negative) | Teacher justifies the binary label using shared-TFBS / GC / expression evidence. |
|
| 212 |
+
| T3 | **post-RFT** `train.t3_rft.jsonl` | The heuristic gold for T3 is a synthetic motif-implant (not unique GT β see `t3_evaluation_design.md` Β§1). RFT (Stage 3c) replaces it with an oracle-validated candidate. Reasoning expansion **must run on the post-RFT JSONL** so the rationale justifies a sequence the oracle has actually scored, not the heuristic. Order: Fusion-SFT β RFT β reasoning expansion β reasoning-augmented Fusion-SFT. |
|
| 213 |
+
|
| 214 |
+
The launcher `slurm/build_reasoning_traces_loop.sh` defaults to the
|
| 215 |
+
heuristic-gold JSONLs for T1/T2 and the heuristic-gold for T3, but
|
| 216 |
+
override `T3_SRC=/dev/shm/dnathinker/runs/exp_t3_fusion_sft_rft_${STAMP}/.../train.t3_rft.jsonl`
|
| 217 |
+
once Stage 3c finishes β the loop's resume logic handles a mid-run
|
| 218 |
+
source swap because the augmented output JSONL keeps row ids.
|
| 219 |
+
|
| 220 |
+
## 6. Input sanitisation β applied globally before any model sees text
|
| 221 |
+
|
| 222 |
+
`regureasoner/utils/input_sanitize.py` (used by `PromptBuilder.user()`
|
| 223 |
+
and `build_reasoning_traces._format_user`) strips three classes of
|
| 224 |
+
issue at read-time, so we don't need to regenerate the prod JSONLs:
|
| 225 |
+
|
| 226 |
+
1. **Label leaks** β `peak_name=chrβ¦`, `enhancer_peak_name=chrβ¦`,
|
| 227 |
+
the "Peak coordinates parsed to chrβ¦:β¦" sentence, the "Observed
|
| 228 |
+
dataset row is a released paired/not_paired link β¦" sentence
|
| 229 |
+
(T2's biggest leak), and `label_source=β¦` lines.
|
| 230 |
+
2. **Unexplained proxy scores** β `Evolution proxy score β¦
|
| 231 |
+
(expression_stability_proxy_v1)`, `promoter_likeness_score=β¦`,
|
| 232 |
+
`quality_score / repeat_fraction / kmer_entropy_norm` (these are
|
| 233 |
+
ad-hoc internal scores the model can't ground; we omit rather
|
| 234 |
+
than try to explain in-prompt).
|
| 235 |
+
3. **Cell-type abbreviations** expanded β `cell_type=Ex` β
|
| 236 |
+
`cell_type=Excitatory neuron (Ex)` so the model knows the biology.
|
| 237 |
+
|
| 238 |
+
Applied before any model call. Idempotent β running it twice yields
|
| 239 |
+
the same string. 12 unit tests cover every leak/score family +
|
| 240 |
+
idempotency + cell-type expansion (`tests/test_input_sanitize.py`).
|
| 241 |
+
|
| 242 |
+
Why we don't run this *inside* `post_bench_pipeline.sh`: the script
|
| 243 |
+
is IO-bound (no GPU), capped at 1000 req/day, and meant to run for
|
| 244 |
+
**multiple days** in the background. Putting it in the GPU pipeline
|
| 245 |
+
would either waste a single 1000-call day or block the rest of the
|
| 246 |
+
pipeline waiting for accumulation. The right pattern is to launch
|
| 247 |
+
`build_reasoning_traces_loop.sh --daemon` once at the start of the
|
| 248 |
+
campaign and let it accumulate rows independently. When a critical
|
| 249 |
+
mass exists, fire a single fusion-SFT run on the augmented JSONL.
|