Upload docs/lab_message_t1_fusion_sft_kickoff.md with huggingface_hub
Browse files
docs/lab_message_t1_fusion_sft_kickoff.md
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Note to lab — kick off T1 Fusion-SFT NOW (lab side)
|
| 2 |
+
|
| 3 |
+
## Why now
|
| 4 |
+
|
| 5 |
+
H100 GPU is saturated by the vLLM bench grid for the next ~8 hours
|
| 6 |
+
(T3 zs_raw running, T3 zs_enriched queued). Lab cluster has spare
|
| 7 |
+
GPUs — best path to parallelise the headline.
|
| 8 |
+
|
| 9 |
+
## What's missing
|
| 10 |
+
|
| 11 |
+
We have **no usable T1 fusion-SFT or LoRA result** anywhere:
|
| 12 |
+
|
| 13 |
+
* Lab smoke `exp_t1_grid_separatedQA_20260424_154915/lora_{raw,enriched}/`
|
| 14 |
+
is **collapsed** — model output is `CTGCTGCTG...` repeated 1790
|
| 15 |
+
characters, 3 unique chars in the first 200. Length ratio 3.64–3.90,
|
| 16 |
+
unusable.
|
| 17 |
+
* H100 hasn't run T1 fusion-SFT yet (queued as Stage 1 of the
|
| 18 |
+
post-bench pipeline; fires after the bench grid exits).
|
| 19 |
+
|
| 20 |
+
So **T1 fusion-SFT is the single biggest missing piece for the
|
| 21 |
+
paper's Table 1 row 4** ("Fusion SFT, per task").
|
| 22 |
+
|
| 23 |
+
## Critical pre-flight
|
| 24 |
+
|
| 25 |
+
You **must** be on a checkout that includes
|
| 26 |
+
[`bda9ee0`](https://github.com/explcre/biomodel_reasoning_calling_study2/commit/bda9ee0)
|
| 27 |
+
("CRITICAL FIX: SFT collators were bypassing the sanitiser") before
|
| 28 |
+
launching. Without that commit the trainer reads raw user content
|
| 29 |
+
including `peak_name=`, the T2 "Observed dataset row is a released
|
| 30 |
+
paired link …" leak, `label_source=`, `Evolution proxy score …`, etc.
|
| 31 |
+
— training on leaky data invalidates the run.
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
git fetch origin mllm-integrate-server2
|
| 35 |
+
git checkout mllm-integrate-server2
|
| 36 |
+
git log -1 --oneline # should be >= bda9ee0 (currently 25b6a4c is head)
|
| 37 |
+
pytest regureasoner_loop/tests/test_sft_collator.py::test_collator_strips_leak_terms_before_tokenisation
|
| 38 |
+
# expect: 1 passed
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## Launch recipe — T1 Fusion-SFT (LLaVA mode, current default)
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
sbatch \
|
| 45 |
+
--nodelist=laniakea --partition=zhanglab.p \
|
| 46 |
+
--gres=gpu:1 --cpus-per-task=8 --mem=120000M --time=12:00:00 \
|
| 47 |
+
--job-name=t1_fusion_sft \
|
| 48 |
+
--export=ALL \
|
| 49 |
+
--wrap="cd $PWD/regureasoner_loop && \
|
| 50 |
+
source slurm/_pixi_env.sh && \
|
| 51 |
+
\$PYTHON_BIN scripts/train_fusion_sft.py \
|
| 52 |
+
--train-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/train.enhancer_generation.strat7c.n35k.jsonl \
|
| 53 |
+
--eval-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/test.enhancer_generation.strat7c.n7k.jsonl \
|
| 54 |
+
--output-dir /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP} \
|
| 55 |
+
--llm-name Qwen/Qwen3.5-2B \
|
| 56 |
+
--dna-model-key ntv3-650m \
|
| 57 |
+
--ntv3-snapshot-path /extra/zhanglab0/INDV/pengchx3/ntv3_local/generative \
|
| 58 |
+
--batch-size 4 --grad-accum 2 --lr 2e-5 --epochs 1 --max-length 2048 \
|
| 59 |
+
--save-strategy steps --save-steps 1000 \
|
| 60 |
+
--architecture-mode llava \
|
| 61 |
+
--wandb-project dnathinker --wandb-run-name t1_fusion_sft_lab_${STAMP} \
|
| 62 |
+
--keep-top-k 3 --best-metric-key eval_loss --best-metric-mode min"
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
Same recipe as H100's `post_bench_pipeline.sh` Stage 1 — just with
|
| 66 |
+
`--output-dir` on `/extra/zhanglab0` so we don't trip H100's tmpfs.
|
| 67 |
+
|
| 68 |
+
## After it lands
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
# Push to HF so H100 can pull it for inference
|
| 72 |
+
python regureasoner_loop/scripts/sync_checkpoints.py \
|
| 73 |
+
--src /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP}/final \
|
| 74 |
+
--dest runs/exp_t1_fusion_sft_lab_${STAMP}/final \
|
| 75 |
+
--repo-id explcre/dnathinker-checkpoints
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
H100 will then pull and run `predict_fusion.py` on the full T1 test
|
| 79 |
+
set + score with the new `run_generation_eval.py` (Stage 1b
|
| 80 |
+
equivalent) → headline T1 fusion-SFT row.
|
| 81 |
+
|
| 82 |
+
## Wall-clock estimate
|
| 83 |
+
|
| 84 |
+
* Training: ~6 h on a single A100/H800 with batch_size=4 grad_accum=2.
|
| 85 |
+
* H100 sync + bench eval: ~2 h after weights land.
|
| 86 |
+
|
| 87 |
+
So **8h end-to-end vs ~14h if we wait for H100's bench grid**.
|
| 88 |
+
|
| 89 |
+
## Architecture-mode ablation (fire alongside if you have GPUs)
|
| 90 |
+
|
| 91 |
+
`slurm/run_unified_arch_ablation.sh` fires three jobs (llava control +
|
| 92 |
+
unified+ntp + unified+mdlm) on the same data. ~10h each on one GPU,
|
| 93 |
+
or parallel across three GPUs. Adds the Phase-2 row to Table 3.
|
| 94 |
+
|
| 95 |
+
## Bonus: T2 + T3 fusion-SFT also
|
| 96 |
+
|
| 97 |
+
Same pattern with `--train-jsonl …pair_prediction.strat7c.n35k.jsonl`
|
| 98 |
+
and `…enhancer_editing.strat7c.n35k.jsonl`. Consider firing all three
|
| 99 |
+
in parallel if you have the GPUs.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
The reaper on H100 (`/dev/shm/dnathinker/genqual_reaper_v5.sh`) will
|
| 104 |
+
auto-score any predictions.jsonl that lands in
|
| 105 |
+
`runs/exp_t*_grid_separatedQA_20260426_h100_vllm_full/*/`. So as soon
|
| 106 |
+
as the bench finishes a task, the new-framework metrics
|
| 107 |
+
(`eval_t3_oracle.py` for T3, `run_generation_eval.py` for T1/T3)
|
| 108 |
+
fire within 30 seconds — no waiting for the post-bench pipeline.
|
| 109 |
+
Same will work for any adapter-prediction the lab produces if you
|
| 110 |
+
drop predictions into a path the reaper watches.
|