explcre commited on
Commit
f1c872a
·
verified ·
1 Parent(s): bf10107

Upload docs/lab_message_t1_fusion_sft_kickoff.md with huggingface_hub

Browse files
docs/lab_message_t1_fusion_sft_kickoff.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Note to lab — kick off T1 Fusion-SFT NOW (lab side)
2
+
3
+ ## Why now
4
+
5
+ H100 GPU is saturated by the vLLM bench grid for the next ~8 hours
6
+ (T3 zs_raw running, T3 zs_enriched queued). Lab cluster has spare
7
+ GPUs — best path to parallelise the headline.
8
+
9
+ ## What's missing
10
+
11
+ We have **no usable T1 fusion-SFT or LoRA result** anywhere:
12
+
13
+ * Lab smoke `exp_t1_grid_separatedQA_20260424_154915/lora_{raw,enriched}/`
14
+ is **collapsed** — model output is `CTGCTGCTG...` repeated 1790
15
+ characters, 3 unique chars in the first 200. Length ratio 3.64–3.90,
16
+ unusable.
17
+ * H100 hasn't run T1 fusion-SFT yet (queued as Stage 1 of the
18
+ post-bench pipeline; fires after the bench grid exits).
19
+
20
+ So **T1 fusion-SFT is the single biggest missing piece for the
21
+ paper's Table 1 row 4** ("Fusion SFT, per task").
22
+
23
+ ## Critical pre-flight
24
+
25
+ You **must** be on a checkout that includes
26
+ [`bda9ee0`](https://github.com/explcre/biomodel_reasoning_calling_study2/commit/bda9ee0)
27
+ ("CRITICAL FIX: SFT collators were bypassing the sanitiser") before
28
+ launching. Without that commit the trainer reads raw user content
29
+ including `peak_name=`, the T2 "Observed dataset row is a released
30
+ paired link …" leak, `label_source=`, `Evolution proxy score …`, etc.
31
+ — training on leaky data invalidates the run.
32
+
33
+ ```bash
34
+ git fetch origin mllm-integrate-server2
35
+ git checkout mllm-integrate-server2
36
+ git log -1 --oneline # should be >= bda9ee0 (currently 25b6a4c is head)
37
+ pytest regureasoner_loop/tests/test_sft_collator.py::test_collator_strips_leak_terms_before_tokenisation
38
+ # expect: 1 passed
39
+ ```
40
+
41
+ ## Launch recipe — T1 Fusion-SFT (LLaVA mode, current default)
42
+
43
+ ```bash
44
+ sbatch \
45
+ --nodelist=laniakea --partition=zhanglab.p \
46
+ --gres=gpu:1 --cpus-per-task=8 --mem=120000M --time=12:00:00 \
47
+ --job-name=t1_fusion_sft \
48
+ --export=ALL \
49
+ --wrap="cd $PWD/regureasoner_loop && \
50
+ source slurm/_pixi_env.sh && \
51
+ \$PYTHON_BIN scripts/train_fusion_sft.py \
52
+ --train-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/train.enhancer_generation.strat7c.n35k.jsonl \
53
+ --eval-jsonl /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/data/prod_samples/test.enhancer_generation.strat7c.n7k.jsonl \
54
+ --output-dir /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP} \
55
+ --llm-name Qwen/Qwen3.5-2B \
56
+ --dna-model-key ntv3-650m \
57
+ --ntv3-snapshot-path /extra/zhanglab0/INDV/pengchx3/ntv3_local/generative \
58
+ --batch-size 4 --grad-accum 2 --lr 2e-5 --epochs 1 --max-length 2048 \
59
+ --save-strategy steps --save-steps 1000 \
60
+ --architecture-mode llava \
61
+ --wandb-project dnathinker --wandb-run-name t1_fusion_sft_lab_${STAMP} \
62
+ --keep-top-k 3 --best-metric-key eval_loss --best-metric-mode min"
63
+ ```
64
+
65
+ Same recipe as H100's `post_bench_pipeline.sh` Stage 1 — just with
66
+ `--output-dir` on `/extra/zhanglab0` so we don't trip H100's tmpfs.
67
+
68
+ ## After it lands
69
+
70
+ ```bash
71
+ # Push to HF so H100 can pull it for inference
72
+ python regureasoner_loop/scripts/sync_checkpoints.py \
73
+ --src /extra/zhanglab0/INDV/pengchx3/regureasoner_loop/runs/exp_t1_fusion_sft_lab_${STAMP}/final \
74
+ --dest runs/exp_t1_fusion_sft_lab_${STAMP}/final \
75
+ --repo-id explcre/dnathinker-checkpoints
76
+ ```
77
+
78
+ H100 will then pull and run `predict_fusion.py` on the full T1 test
79
+ set + score with the new `run_generation_eval.py` (Stage 1b
80
+ equivalent) → headline T1 fusion-SFT row.
81
+
82
+ ## Wall-clock estimate
83
+
84
+ * Training: ~6 h on a single A100/H800 with batch_size=4 grad_accum=2.
85
+ * H100 sync + bench eval: ~2 h after weights land.
86
+
87
+ So **8h end-to-end vs ~14h if we wait for H100's bench grid**.
88
+
89
+ ## Architecture-mode ablation (fire alongside if you have GPUs)
90
+
91
+ `slurm/run_unified_arch_ablation.sh` fires three jobs (llava control +
92
+ unified+ntp + unified+mdlm) on the same data. ~10h each on one GPU,
93
+ or parallel across three GPUs. Adds the Phase-2 row to Table 3.
94
+
95
+ ## Bonus: T2 + T3 fusion-SFT also
96
+
97
+ Same pattern with `--train-jsonl …pair_prediction.strat7c.n35k.jsonl`
98
+ and `…enhancer_editing.strat7c.n35k.jsonl`. Consider firing all three
99
+ in parallel if you have the GPUs.
100
+
101
+ ---
102
+
103
+ The reaper on H100 (`/dev/shm/dnathinker/genqual_reaper_v5.sh`) will
104
+ auto-score any predictions.jsonl that lands in
105
+ `runs/exp_t*_grid_separatedQA_20260426_h100_vllm_full/*/`. So as soon
106
+ as the bench finishes a task, the new-framework metrics
107
+ (`eval_t3_oracle.py` for T3, `run_generation_eval.py` for T1/T3)
108
+ fire within 30 seconds — no waiting for the post-bench pipeline.
109
+ Same will work for any adapter-prediction the lab produces if you
110
+ drop predictions into a path the reaper watches.