dnathinker-checkpoints / docs /lab_message_2026_04_27_v2.md
explcre's picture
Upload docs/lab_message_2026_04_27_v2.md with huggingface_hub
9dc753a verified

Note to lab β€” H100-side update v2, 2026-04-27 ~04:00 UTC

Branch state

  • mllm-integrate-server2 is 13 commits ahead of mllm-integrate since the last merge (commit 43682fe).
  • Lab's mllm-integrate HEAD has not advanced since 43682fe (the previous merge into main); please pull / merge to pick up the v5 work.
e133cf1  SV-GSPO T3 reward fix + post-v5 follow-ups
ffb0c5f  T3 v5 propagated to paper_outline + minimal_publishable_suite
25504fd  T3 multi-turn rejection sampling + clear metrics quickref
3e65c96  T3 solid: post-RFT JSONL β†’ reasoning expansion handoff
bb6704e  Global input sanitiser (label leaks, proxy scores, cell-type expand)
179903c  Reasoning-trace generator (OpenRouter Ling-2.6-1T)
945dc55  adapter→eval bridge: predict_fusion.py + post-bench wiring
183e645  T3 RFT (rejection fine-tuning) β€” Stage B
46e29d7  H100 results snapshot @ 01:50 UTC
4b03b42  T3 reasoning-only SFT (mask_assistant_dna_span)
b5c9a86  docs: T3 evaluation design + PWM supplementary
af44fa4  T3 oracle-based eval (objective satisfaction)
b2a32be  h100_progress: plan v4-final

To pull on lab cluster:

git fetch origin mllm-integrate-server2
git merge origin/mllm-integrate-server2 -m "merge v5 from H100"
# or if you want a clean history:
git rebase origin/mllm-integrate-server2

Action items, ordered by urgency

1. SV-GSPO outcome reward β€” pull before next RL run (CRITICAL)

regureasoner/rl/reward_shaper.py:outcome_enhancer_editing was training the agent on the wrong T3 objective (edit-distance window in [1, 60]). Under v5, the headline T3 metric is objective satisfaction (within_budget AND length_preserved AND target_motif_present) β€” see docs/t3_metrics_quickref.md.

Fixed in commit e133cf1. New reward = average of three binary checks aligned with eval_t3_oracle.py. Any in-flight or upcoming SV-GSPO run on T3 must be on a checkout that includes this commit β€” otherwise the headline number suffers.

If you have a T3 SV-GSPO run currently queued or training: stop it, rebase, restart. 56/56 unit tests pass on the new reward.

2. T2 enriched dataset regen β€” needs galaxy CPU (BLOCKER on T2 quality)

The current prod T2 enriched JSONL only has TFBS scan on the promoter; the enhancer side gets only a GC% line. That defeats T2's premise β€” the model can't reason about shared TFBS hits if only one side is scanned.

Fix exists in tools/pe_grounding_tools.py:_template_tfbs_sequence_names_for_example which already returns ["input_promoter", "input_enhancer"] for T2 β€” it just wasn't run on the prod data.

Launcher committed at regureasoner_loop/slurm/regen_t2_enriched_with_enhancer_scan.sh. Drives the parent's PEDatasetReasoningPipeline in template_tools mode (no LLM, disk-cached fimo).

H100 can't run this (raw CSVs + compiled FIMO live on lab cluster only). Suggested galaxy invocation (CPU-rich, ~8 cores):

cd /home/pengchx3/text-dna/biomodel_reasoning_calling_study2
git checkout origin/mllm-integrate-server2
for i in $(seq 0 7); do
  SHARD_INDEX=$i NUM_SHARDS=8 \
    bash regureasoner_loop/slurm/regen_t2_enriched_with_enhancer_scan.sh &
done
wait

# Output:
#   /dev/shm/dnathinker/data/t2_regen_enhancer_scan/jsonl/{train,test}.pair_prediction.jsonl
# Push to HF when done so H100 can pick it up:
python regureasoner_loop/scripts/sync_checkpoints.py \
    --src /dev/shm/dnathinker/data/t2_regen_enhancer_scan/jsonl \
    --dest data/prod_full_test_v2_enhancer_scan/jsonl \
    --repo-id explcre/dnathinker-checkpoints

After that lands on HF, the H100 will rebench T2 with proper enhancer TFBS context. ETA on galaxy: 8h sharded (~744k rows / 8 shards Γ— 30s per row average). Cached on second pass.

3. T3 RFT-from-joint ablation β€” extra Table 3 row (NICE-TO-HAVE)

The current pipeline runs T3 RFT against the Stage-3 (T3-only) adapter. A worthwhile ablation: run RFT against the Stage-4 joint adapter β€” does the joint-trained generator produce candidates with higher mean objective margin, or do format artefacts dominate? One flag change:

STAGE_4=runs/exp_joint_multitask_${STAMP}/final/pytorch_model.bin
python regureasoner_loop/scripts/rft_t3.py \
    --adapter-state-dict $STAGE_4 \
    --train-jsonl data/prod_samples/train.enhancer_editing.strat7c.n35k.jsonl \
    --oracle-path runs/exp_oracle_ds_7cell_min/oracle.pt \
    --output-jsonl runs/exp_t3_rft_from_joint_${STAMP}/rft_filtered_train.jsonl \
    --candidates 4 --rounds 4 --temp-ramp 0.15
# Re-train T3 fusion-SFT on the result for the ablation row.

Cost: ~6h serial after Stage 4 (joint multitask) finishes on H100. Lab has spare GPU? This is yours.

Detail in docs/t3_post_v5_followups.md Β§1.

4. Loop-SFT for T3 β€” swap data source (NICE-TO-HAVE)

No code change. The T3 trajectory dataset for Loop-SFT should source from the post-RFT JSONL (oracle-validated candidates) instead of the heuristic gold:

python regureasoner_loop/scripts/expand_loop_trajectories.py \
    --source runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl \
    --out    data/trajectories/train.enhancer_editing.rft.jsonl
TASK=enhancer_editing \
TRAIN_JSONL=data/trajectories/train.enhancer_editing.rft.jsonl \
... \
bash regureasoner_loop/slurm/run_train_loop_sft.sh

Lab side, since H100 doesn't have the OpenRouter throughput for trajectory expansion at the 35k-row scale (free tier is 1000/day per key β€” fine for 333-row reasoning ablations, not for full Loop-SFT data).

5. External baselines for paper headline β€” TACO + HyenaDNA (CRITICAL)

The paper currently has only internal baselines (zero-shot LLM, fusion-SFT, our NTv3-direct). Reviewers will ask "where's the SOTA comparison?". Two must-add baselines:

  • TACO (Lin et al. NeurIPS 2024) β€” T3 paper precedent. Their repo is public; drop in our DeepSTARR-7cell oracle, run their trainer on our T3 train split, eval with eval_t3_oracle.py. ~1 day.
  • HyenaDNA (Nguyen et al. NeurIPS 2023) β€” T2 fluency baseline. Already wired as encoder in our stack; needs head training only. ~1 day.

Lab side because both need cluster GPUs.

Detail + concrete recipes in docs/t3_post_v5_followups.md Β§5.

6. Pull from HF β€” new artifacts available

H100 just pushed (4:00 UTC):

data/reasoning_traces/train.enhancer_generation.reasoning.jsonl   (in flight, ~62 rows so far, target 333)
data/reasoning_traces/smoke_5rows_{t1,t2,t3}_postsanitize.jsonl   (per-task quality samples for inspection)
data/reasoning_traces/post_rft_contract_fixture.jsonl              (synthetic post-RFT row used in unit test)
data/reasoning_traces/post_rft_smoke.jsonl                          (real OpenRouter rationale on synthetic post-RFT input)

Repo: explcre/dnathinker-checkpoints. Inspect the smoke files to verify rationale quality + sanitiser correctness before wider rollout.

7. Reasoning-trace daily-loop β€” coordinate API keys

The 1000 req/day OpenRouter free-tier cap means one key drives ~333 rows/task/day. With the user's primary key on H100 we'll build T1/T2/T3 reasoning at ~1k rows/day combined.

If you have spare OpenRouter accounts, run:

OPENROUTER_API_KEY=<lab key> bash regureasoner_loop/slurm/build_reasoning_traces_loop.sh --daemon

on a CPU box (zero GPU). Each shard is a separate run; the script auto-resumes by id, so multiple boxes running with different keys won't overlap if they share an output JSONL.

What's running on H100 right now

PID 121129  vLLM bench T2 zs_enriched (full 744k, ~3.5h in, ETA ~30 min)
            queued: T3 zs_raw, T3 zs_enriched (~5h each)
PID 137805  build_reasoning_traces.py T1 333-sample run (62/333 at 04:00 UTC)
PID 100544  watcher β†’ post_bench_pipeline.sh (idle until orchestrator exits)

ETA full chain: ~36h after bench grid finishes.

Pipeline state β€” no jobs need killing

  • Bench grid: vLLM zero-shot inference; T3 zs eval reads only metadata (target_motif, edit_budget) β€” no v5-framework leakage. Safe.
  • No fusion-SFT or RL job currently training; Stages 1–4 fire only after bench grid completes, at which point they pick up multi-turn RFT (commit 25504fd) + Stage 3d post-RFT reasoning (commit 3e65c96) automatically.

Suggested coordination

Lab actions, in priority order:

  1. Pull mllm-integrate-server2 (or merge into mllm-integrate).
  2. Stop any in-flight T3 SV-GSPO run if it predates e133cf1 β€” the reward function was wrong; restart with the new commit.
  3. Galaxy: T2 enhancer-scan regen (background, ~8h sharded) β€” blocks the headline T2 numbers.
  4. TACO + HyenaDNA baselines in parallel.
  5. RFT-from-joint ablation + Loop-SFT-on-RFT as second-tier ablations once Stage 4 lands.

Reach out on the shared channel if any of these conflict with in-flight work.

β€” H100 side