Note to lab — kick off T1 Fusion-SFT NOW (lab side)
Why now
H100 GPU is saturated by the vLLM bench grid for the next ~8 hours (T3 zs_raw running, T3 zs_enriched queued). Lab cluster has spare GPUs — best path to parallelise the headline.
What's missing
We have no usable T1 fusion-SFT or LoRA result anywhere:
- Lab smoke
exp_t1_grid_separatedQA_20260424_154915/lora_{raw,enriched}/is collapsed — model output isCTGCTGCTG...repeated 1790 characters, 3 unique chars in the first 200. Length ratio 3.64–3.90, unusable. - H100 hasn't run T1 fusion-SFT yet (queued as Stage 1 of the post-bench pipeline; fires after the bench grid exits).
So T1 fusion-SFT is the single biggest missing piece for the paper's Table 1 row 4 ("Fusion SFT, per task").
Critical pre-flight
You must be on a checkout that includes
bda9ee0
("CRITICAL FIX: SFT collators were bypassing the sanitiser") before
launching. Without that commit the trainer reads raw user content
including peak_name=, the T2 "Observed dataset row is a released
paired link …" leak, label_source=, Evolution proxy score …, etc.
— training on leaky data invalidates the run.
git fetch origin mllm-integrate-server2
git checkout mllm-integrate-server2
git log -1 --oneline # should be >= bda9ee0 (currently 25b6a4c is head)
pytest regureasoner_loop/tests/test_sft_collator.py::test_collator_strips_leak_terms_before_tokenisation
# expect: 1 passed
Launch recipe — T1 Fusion-SFT (LLaVA mode, current default)
sbatch \
--nodelist=laniakea --partition=zhanglab.p \
--gres=gpu:1 --cpus-per-task=8 --mem=120000M --time=12:00:00 \
--job-name=t1_fusion_sft \
--export=ALL \
--wrap="cd $PWD/regureasoner_loop && \
source slurm/_pixi_env.sh && \
\$PYTHON_BIN scripts/train_fusion_sft.py \
--train-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/train.enhancer_generation.strat7c.n35k.jsonl \
--eval-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/test.enhancer_generation.strat7c.n7k.jsonl \
--output-dir /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP} \
--llm-name Qwen/Qwen3.5-2B \
--dna-model-key ntv3-650m \
--ntv3-snapshot-path /extra/zhanglab0/INDV/pengchx3/ntv3_local/generative \
--batch-size 4 --grad-accum 2 --lr 2e-5 --epochs 1 --max-length 2048 \
--save-strategy steps --save-steps 1000 \
--architecture-mode llava \
--wandb-project dnathinker --wandb-run-name t1_fusion_sft_lab_${STAMP} \
--keep-top-k 3 --best-metric-key eval_loss --best-metric-mode min"
Same recipe as H100's post_bench_pipeline.sh Stage 1 — just with
--output-dir on /extra/zhanglab0 so we don't trip H100's tmpfs.
After it lands
# Push to HF so H100 can pull it for inference
python regureasoner_loop/scripts/sync_checkpoints.py \
--src /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP}/final \
--dest runs/exp_t1_fusion_sft_lab_${STAMP}/final \
--repo-id explcre/dnathinker-checkpoints
H100 will then pull and run predict_fusion.py on the full T1 test
set + score with the new run_generation_eval.py (Stage 1b
equivalent) → headline T1 fusion-SFT row.
Wall-clock estimate
- Training: ~6 h on a single A100/H800 with batch_size=4 grad_accum=2.
- H100 sync + bench eval: ~2 h after weights land.
So 8h end-to-end vs ~14h if we wait for H100's bench grid.
Architecture-mode ablation (fire alongside if you have GPUs)
slurm/run_unified_arch_ablation.sh fires three jobs (llava control +
unified+ntp + unified+mdlm) on the same data. ~10h each on one GPU,
or parallel across three GPUs. Adds the Phase-2 row to Table 3.
Bonus: T2 + T3 fusion-SFT also
Same pattern with --train-jsonl …pair_prediction.strat7c.n35k.jsonl
and …enhancer_editing.strat7c.n35k.jsonl. Consider firing all three
in parallel if you have the GPUs.
The reaper on H100 (/dev/shm/dnathinker/genqual_reaper_v5.sh) will
auto-score any predictions.jsonl that lands in
runs/exp_t*_grid_separatedQA_20260426_h100_vllm_full/*/. So as soon
as the bench finishes a task, the new-framework metrics
(eval_t3_oracle.py for T3, run_generation_eval.py for T1/T3)
fire within 30 seconds — no waiting for the post-bench pipeline.
Same will work for any adapter-prediction the lab produces if you
drop predictions into a path the reaper watches.