| # Note to lab — kick off T1 Fusion-SFT NOW (lab side) |
|
|
| ## Why now |
|
|
| H100 GPU is saturated by the vLLM bench grid for the next ~8 hours |
| (T3 zs_raw running, T3 zs_enriched queued). Lab cluster has spare |
| GPUs — best path to parallelise the headline. |
|
|
| ## What's missing |
|
|
| We have **no usable T1 fusion-SFT or LoRA result** anywhere: |
|
|
| * Lab smoke `exp_t1_grid_separatedQA_20260424_154915/lora_{raw,enriched}/` |
| is **collapsed** — model output is `CTGCTGCTG...` repeated 1790 |
| characters, 3 unique chars in the first 200. Length ratio 3.64–3.90, |
| unusable. |
| * H100 hasn't run T1 fusion-SFT yet (queued as Stage 1 of the |
| post-bench pipeline; fires after the bench grid exits). |
|
|
| So **T1 fusion-SFT is the single biggest missing piece for the |
| paper's Table 1 row 4** ("Fusion SFT, per task"). |
|
|
| ## Critical pre-flight |
|
|
| You **must** be on a checkout that includes |
| [`bda9ee0`](https://github.com/explcre/biomodel_reasoning_calling_study2/commit/bda9ee0) |
| ("CRITICAL FIX: SFT collators were bypassing the sanitiser") before |
| launching. Without that commit the trainer reads raw user content |
| including `peak_name=`, the T2 "Observed dataset row is a released |
| paired link …" leak, `label_source=`, `Evolution proxy score …`, etc. |
| — training on leaky data invalidates the run. |
|
|
| ```bash |
| git fetch origin mllm-integrate-server2 |
| git checkout mllm-integrate-server2 |
| git log -1 --oneline # should be >= bda9ee0 (currently 25b6a4c is head) |
| pytest regureasoner_loop/tests/test_sft_collator.py::test_collator_strips_leak_terms_before_tokenisation |
| # expect: 1 passed |
| ``` |
|
|
| ## Launch recipe — T1 Fusion-SFT (LLaVA mode, current default) |
|
|
| ```bash |
| sbatch \ |
| --nodelist=laniakea --partition=zhanglab.p \ |
| --gres=gpu:1 --cpus-per-task=8 --mem=120000M --time=12:00:00 \ |
| --job-name=t1_fusion_sft \ |
| --export=ALL \ |
| --wrap="cd $PWD/regureasoner_loop && \ |
| source slurm/_pixi_env.sh && \ |
| \$PYTHON_BIN scripts/train_fusion_sft.py \ |
| --train-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/train.enhancer_generation.strat7c.n35k.jsonl \ |
| --eval-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/test.enhancer_generation.strat7c.n7k.jsonl \ |
| --output-dir /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP} \ |
| --llm-name Qwen/Qwen3.5-2B \ |
| --dna-model-key ntv3-650m \ |
| --ntv3-snapshot-path /extra/zhanglab0/INDV/pengchx3/ntv3_local/generative \ |
| --batch-size 4 --grad-accum 2 --lr 2e-5 --epochs 1 --max-length 2048 \ |
| --save-strategy steps --save-steps 1000 \ |
| --architecture-mode llava \ |
| --wandb-project dnathinker --wandb-run-name t1_fusion_sft_lab_${STAMP} \ |
| --keep-top-k 3 --best-metric-key eval_loss --best-metric-mode min" |
| ``` |
|
|
| Same recipe as H100's `post_bench_pipeline.sh` Stage 1 — just with |
| `--output-dir` on `/extra/zhanglab0` so we don't trip H100's tmpfs. |
|
|
| ## After it lands |
|
|
| ```bash |
| # Push to HF so H100 can pull it for inference |
| python regureasoner_loop/scripts/sync_checkpoints.py \ |
| --src /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP}/final \ |
| --dest runs/exp_t1_fusion_sft_lab_${STAMP}/final \ |
| --repo-id explcre/dnathinker-checkpoints |
| ``` |
|
|
| H100 will then pull and run `predict_fusion.py` on the full T1 test |
| set + score with the new `run_generation_eval.py` (Stage 1b |
| equivalent) → headline T1 fusion-SFT row. |
|
|
| ## Wall-clock estimate |
|
|
| * Training: ~6 h on a single A100/H800 with batch_size=4 grad_accum=2. |
| * H100 sync + bench eval: ~2 h after weights land. |
|
|
| So **8h end-to-end vs ~14h if we wait for H100's bench grid**. |
|
|
| ## Architecture-mode ablation (fire alongside if you have GPUs) |
|
|
| `slurm/run_unified_arch_ablation.sh` fires three jobs (llava control + |
| unified+ntp + unified+mdlm) on the same data. ~10h each on one GPU, |
| or parallel across three GPUs. Adds the Phase-2 row to Table 3. |
|
|
| ## Bonus: T2 + T3 fusion-SFT also |
|
|
| Same pattern with `--train-jsonl …pair_prediction.strat7c.n35k.jsonl` |
| and `…enhancer_editing.strat7c.n35k.jsonl`. Consider firing all three |
| in parallel if you have the GPUs. |
|
|
| --- |
|
|
| The reaper on H100 (`/dev/shm/dnathinker/genqual_reaper_v5.sh`) will |
| auto-score any predictions.jsonl that lands in |
| `runs/exp_t*_grid_separatedQA_20260426_h100_vllm_full/*/`. So as soon |
| as the bench finishes a task, the new-framework metrics |
| (`eval_t3_oracle.py` for T3, `run_generation_eval.py` for T1/T3) |
| fire within 30 seconds — no waiting for the post-bench pipeline. |
| Same will work for any adapter-prediction the lab produces if you |
| drop predictions into a path the reaper watches. |
|
|