File size: 8,875 Bytes
9dc753a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# Note to lab β€” H100-side update v2, 2026-04-27 ~04:00 UTC

## Branch state

* `mllm-integrate-server2` is **13 commits ahead** of
  `mllm-integrate` since the last merge (commit `43682fe`).
* Lab's `mllm-integrate` HEAD has not advanced since `43682fe` (the
  previous merge into main); please pull / merge to pick up the v5 work.

```
e133cf1  SV-GSPO T3 reward fix + post-v5 follow-ups
ffb0c5f  T3 v5 propagated to paper_outline + minimal_publishable_suite
25504fd  T3 multi-turn rejection sampling + clear metrics quickref
3e65c96  T3 solid: post-RFT JSONL β†’ reasoning expansion handoff
bb6704e  Global input sanitiser (label leaks, proxy scores, cell-type expand)
179903c  Reasoning-trace generator (OpenRouter Ling-2.6-1T)
945dc55  adapter→eval bridge: predict_fusion.py + post-bench wiring
183e645  T3 RFT (rejection fine-tuning) β€” Stage B
46e29d7  H100 results snapshot @ 01:50 UTC
4b03b42  T3 reasoning-only SFT (mask_assistant_dna_span)
b5c9a86  docs: T3 evaluation design + PWM supplementary
af44fa4  T3 oracle-based eval (objective satisfaction)
b2a32be  h100_progress: plan v4-final
```

To pull on lab cluster:

```bash
git fetch origin mllm-integrate-server2
git merge origin/mllm-integrate-server2 -m "merge v5 from H100"
# or if you want a clean history:
git rebase origin/mllm-integrate-server2
```

## Action items, ordered by urgency

### 1. SV-GSPO outcome reward β€” pull before next RL run (CRITICAL)

`regureasoner/rl/reward_shaper.py:outcome_enhancer_editing` was
**training the agent on the wrong T3 objective** (edit-distance
window in `[1, 60]`). Under v5, the headline T3 metric is *objective
satisfaction* (`within_budget` AND `length_preserved` AND
`target_motif_present`) β€” see `docs/t3_metrics_quickref.md`.

Fixed in commit `e133cf1`. New reward = average of three binary
checks aligned with `eval_t3_oracle.py`. **Any in-flight or
upcoming SV-GSPO run on T3 must be on a checkout that includes this
commit** β€” otherwise the headline number suffers.

If you have a T3 SV-GSPO run currently queued or training: stop it,
rebase, restart. 56/56 unit tests pass on the new reward.

### 2. T2 enriched dataset regen β€” needs galaxy CPU (BLOCKER on T2 quality)

The current prod T2 enriched JSONL only has TFBS scan on the
**promoter**; the enhancer side gets only a GC% line. That defeats
T2's premise β€” the model can't reason about shared TFBS hits if only
one side is scanned.

Fix exists in `tools/pe_grounding_tools.py:_template_tfbs_sequence_names_for_example`
which already returns `["input_promoter", "input_enhancer"]` for
T2 β€” it just wasn't run on the prod data.

Launcher committed at
`regureasoner_loop/slurm/regen_t2_enriched_with_enhancer_scan.sh`.
Drives the parent's `PEDatasetReasoningPipeline` in
`template_tools` mode (no LLM, disk-cached fimo).

H100 can't run this (raw CSVs + compiled FIMO live on lab cluster
only). Suggested galaxy invocation (CPU-rich, ~8 cores):

```bash
cd /home/pengchx3/text-dna/biomodel_reasoning_calling_study2
git checkout origin/mllm-integrate-server2
for i in $(seq 0 7); do
  SHARD_INDEX=$i NUM_SHARDS=8 \
    bash regureasoner_loop/slurm/regen_t2_enriched_with_enhancer_scan.sh &
done
wait

# Output:
#   /dev/shm/dnathinker/data/t2_regen_enhancer_scan/jsonl/{train,test}.pair_prediction.jsonl
# Push to HF when done so H100 can pick it up:
python regureasoner_loop/scripts/sync_checkpoints.py \
    --src /dev/shm/dnathinker/data/t2_regen_enhancer_scan/jsonl \
    --dest data/prod_full_test_v2_enhancer_scan/jsonl \
    --repo-id explcre/dnathinker-checkpoints
```

After that lands on HF, the H100 will rebench T2 with proper enhancer
TFBS context. ETA on galaxy: 8h sharded (~744k rows / 8 shards Γ— 30s
per row average). Cached on second pass.

### 3. T3 RFT-from-joint ablation β€” extra Table 3 row (NICE-TO-HAVE)

The current pipeline runs T3 RFT against the Stage-3 (T3-only)
adapter. A worthwhile ablation: run RFT against the Stage-4 joint
adapter β€” does the joint-trained generator produce candidates with
higher mean objective margin, or do format artefacts dominate? One
flag change:

```bash
STAGE_4=runs/exp_joint_multitask_${STAMP}/final/pytorch_model.bin
python regureasoner_loop/scripts/rft_t3.py \
    --adapter-state-dict $STAGE_4 \
    --train-jsonl data/prod_samples/train.enhancer_editing.strat7c.n35k.jsonl \
    --oracle-path runs/exp_oracle_ds_7cell_min/oracle.pt \
    --output-jsonl runs/exp_t3_rft_from_joint_${STAMP}/rft_filtered_train.jsonl \
    --candidates 4 --rounds 4 --temp-ramp 0.15
# Re-train T3 fusion-SFT on the result for the ablation row.
```

Cost: ~6h serial after Stage 4 (joint multitask) finishes on H100.
Lab has spare GPU? This is yours.

Detail in `docs/t3_post_v5_followups.md` Β§1.

### 4. Loop-SFT for T3 β€” swap data source (NICE-TO-HAVE)

No code change. The T3 trajectory dataset for Loop-SFT should source
from the post-RFT JSONL (oracle-validated candidates) instead of the
heuristic gold:

```bash
python regureasoner_loop/scripts/expand_loop_trajectories.py \
    --source runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl \
    --out    data/trajectories/train.enhancer_editing.rft.jsonl
TASK=enhancer_editing \
TRAIN_JSONL=data/trajectories/train.enhancer_editing.rft.jsonl \
... \
bash regureasoner_loop/slurm/run_train_loop_sft.sh
```

Lab side, since H100 doesn't have the OpenRouter throughput for
trajectory expansion at the 35k-row scale (free tier is 1000/day per
key β€” fine for 333-row reasoning ablations, not for full Loop-SFT
data).

### 5. External baselines for paper headline β€” TACO + HyenaDNA (CRITICAL)

The paper currently has only internal baselines (zero-shot LLM,
fusion-SFT, our NTv3-direct). Reviewers will ask "where's the SOTA
comparison?". Two must-add baselines:

* **TACO** (Lin et al. NeurIPS 2024) β€” T3 paper precedent. Their repo
  is public; drop in our DeepSTARR-7cell oracle, run their trainer on
  our T3 train split, eval with `eval_t3_oracle.py`. ~1 day.
* **HyenaDNA** (Nguyen et al. NeurIPS 2023) β€” T2 fluency baseline.
  Already wired as encoder in our stack; needs head training only.
  ~1 day.

Lab side because both need cluster GPUs.

Detail + concrete recipes in `docs/t3_post_v5_followups.md` Β§5.

### 6. Pull from HF β€” new artifacts available

H100 just pushed (4:00 UTC):

```
data/reasoning_traces/train.enhancer_generation.reasoning.jsonl   (in flight, ~62 rows so far, target 333)
data/reasoning_traces/smoke_5rows_{t1,t2,t3}_postsanitize.jsonl   (per-task quality samples for inspection)
data/reasoning_traces/post_rft_contract_fixture.jsonl              (synthetic post-RFT row used in unit test)
data/reasoning_traces/post_rft_smoke.jsonl                          (real OpenRouter rationale on synthetic post-RFT input)
```

Repo: `explcre/dnathinker-checkpoints`. Inspect the smoke files to
verify rationale quality + sanitiser correctness before wider rollout.

### 7. Reasoning-trace daily-loop β€” coordinate API keys

The 1000 req/day OpenRouter free-tier cap means **one key drives
~333 rows/task/day**. With the user's primary key on H100 we'll
build T1/T2/T3 reasoning at ~1k rows/day combined.

If you have spare OpenRouter accounts, run:

```bash
OPENROUTER_API_KEY=<lab key> bash regureasoner_loop/slurm/build_reasoning_traces_loop.sh --daemon
```

on a CPU box (zero GPU). Each shard is a separate run; the script
auto-resumes by id, so multiple boxes running with different keys
won't overlap if they share an output JSONL.

## What's running on H100 right now

```
PID 121129  vLLM bench T2 zs_enriched (full 744k, ~3.5h in, ETA ~30 min)
            queued: T3 zs_raw, T3 zs_enriched (~5h each)
PID 137805  build_reasoning_traces.py T1 333-sample run (62/333 at 04:00 UTC)
PID 100544  watcher β†’ post_bench_pipeline.sh (idle until orchestrator exits)
```

ETA full chain: ~36h after bench grid finishes.

## Pipeline state β€” no jobs need killing

* Bench grid: vLLM zero-shot inference; T3 zs eval reads only metadata
  (target_motif, edit_budget) β€” no v5-framework leakage. Safe.
* No fusion-SFT or RL job currently training; Stages 1–4 fire only
  after bench grid completes, at which point they pick up multi-turn
  RFT (commit `25504fd`) + Stage 3d post-RFT reasoning (commit
  `3e65c96`) automatically.

## Suggested coordination

Lab actions, in priority order:

1. **Pull `mllm-integrate-server2`** (or merge into `mllm-integrate`).
2. **Stop any in-flight T3 SV-GSPO run** if it predates `e133cf1` β€”
   the reward function was wrong; restart with the new commit.
3. **Galaxy: T2 enhancer-scan regen** (background, ~8h sharded) β€”
   blocks the headline T2 numbers.
4. **TACO + HyenaDNA baselines** in parallel.
5. **RFT-from-joint ablation** + **Loop-SFT-on-RFT** as second-tier
   ablations once Stage 4 lands.

Reach out on the shared channel if any of these conflict with
in-flight work.

β€” H100 side