Upload results/status_20260427_0700utc.md with huggingface_hub
Browse files
results/status_20260427_0700utc.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pipeline status β 2026-04-27 ~07:00 UTC
|
| 2 |
+
|
| 3 |
+
Lab pushed 6 commits between 04:40 and 07:17 UTC; H100 has merged them
|
| 4 |
+
into `mllm-integrate-server2` (HEAD `f97f054`). Auto-mode summary +
|
| 5 |
+
user decisions blocked on confirmation.
|
| 6 |
+
|
| 7 |
+
## Auto-progress on H100 (no user input needed)
|
| 8 |
+
|
| 9 |
+
| Item | State |
|
| 10 |
+
|---|---|
|
| 11 |
+
| **T3 zs_raw bench** (PID 139902) | running, ~02:55 elapsed; tqdm shows N/N (display tick β actual `predictions.jsonl` not yet flushed) |
|
| 12 |
+
| **T3 zs_enriched bench** | queued; auto-fires when zs_raw exits |
|
| 13 |
+
| **Reaper PID 146066** | alive; auto-scores every fresh `predictions.jsonl` with `eval_t3_oracle.py` + `run_generation_eval.py` (`--device auto` β CPU now, GPU after vLLM frees) |
|
| 14 |
+
| **T1 zs_raw oracle metrics** | LANDED (genqual.json on HF; per-cell numbers in `results/zeroshot_results_table_20260427.md`) |
|
| 15 |
+
| **T1 zs_enriched oracle metrics** | scoring (CPU; ETA ~25 min more) |
|
| 16 |
+
| **T1 reasoning expansion 333/333** | DONE; HF mirrored |
|
| 17 |
+
| **Branch sync** | merged lab's 6 commits; pushed `f97f054`; **0 behind, 2 ahead** |
|
| 18 |
+
|
| 19 |
+
## Lab updates absorbed
|
| 20 |
+
|
| 21 |
+
| Commit | What |
|
| 22 |
+
|---|---|
|
| 23 |
+
| `bb9a5f1` | Lab merged H100's `4149b16 + 694181f + cb91c26 + bda9ee0` into `mllm-integrate` |
|
| 24 |
+
| `0b30598` | Per-job TRITON_CACHE_DIR fix β unblocks 226086 (NTv3-8m encoder) re-run |
|
| 25 |
+
| `3b40df9` | Lab snapshot 07:17 UTC + the explicit "scancel 226075/076/077?" decision |
|
| 26 |
+
| `ff3ab4d` | full_enriched data is the headline; prod_samples is iteration |
|
| 27 |
+
| `c4d8981` | second sync of `mllm-integrate-server2` into `mllm-integrate` |
|
| 28 |
+
| `db4ac99` | T2 regen v5 runbook β 192 shards (3Γ speedup) instead of 64 |
|
| 29 |
+
|
| 30 |
+
## Lab full enriched JSONL β landed on HF
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
explcre/celltype_conditioned_enhancer_generation/data/full_enriched/jsonl/
|
| 34 |
+
βββ train.enhancer_generation.jsonl 14.2 GB (1,509,379 rows) β T1 train
|
| 35 |
+
βββ test.enhancer_generation.jsonl 3.50 GB (372,210 rows) β T1 test (== H100's prod_full_test)
|
| 36 |
+
βββ train.enhancer_editing.jsonl ~14 GB (1,509,379 rows) β T3 train
|
| 37 |
+
βββ test.enhancer_editing.jsonl 3.69 GB (372,210 rows) β T3 test
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
T2 still pending (galaxy regen v5 must succeed first; see "Decision 2"
|
| 41 |
+
below).
|
| 42 |
+
|
| 43 |
+
**Key match**: H100's `data/prod_full_test/jsonl/test.enhancer_generation.jsonl`
|
| 44 |
+
is **bit-identical** to lab's `data/full_enriched/test.enhancer_generation.jsonl`
|
| 45 |
+
(both 372,210 rows, 3.50 GB). H100's existing zs benches are already
|
| 46 |
+
on the headline test set.
|
| 47 |
+
|
| 48 |
+
What changes when we use `full_enriched/train.*.jsonl` (1.5 M rows)
|
| 49 |
+
vs the legacy `prod_samples/strat7c.n35k.jsonl` (35 k rows): training
|
| 50 |
+
sees ~43Γ more data, takes ~43Γ longer (a Stage-1 fusion-SFT goes
|
| 51 |
+
from ~3 h on H100 to ~5 days). Tradeoff:
|
| 52 |
+
|
| 53 |
+
* **Lab side**: spending the GPU-days on full_enriched IS the headline.
|
| 54 |
+
* **H100 side**: stays on `prod_samples/n35k` for fast iteration cycles
|
| 55 |
+
(one-day end-to-end pipeline, multiple ablations / tier
|
| 56 |
+
comparisons). The two together fill the paper table β H100's
|
| 57 |
+
smaller-N runs serve as "controlled" ablation rows; lab's
|
| 58 |
+
larger-N run is the headline.
|
| 59 |
+
|
| 60 |
+
## β οΈ Two decisions blocked on user (per auto-mode rule on destructive actions)
|
| 61 |
+
|
| 62 |
+
### Decision 1 β scancel 226075/076/077 arch ablation?
|
| 63 |
+
|
| 64 |
+
**Recommend: YES, scancel all three.** Lab is "awaiting user OK" per
|
| 65 |
+
their snapshot. Reasoning:
|
| 66 |
+
|
| 67 |
+
* The three jobs were submitted from `ff3ab4d` (predates `cb91c26`
|
| 68 |
+
which fixes the unified collator's training-time leak).
|
| 69 |
+
* Lab confirmed the leak signature in 226076 (eval=0.036 at step 1000
|
| 70 |
+
matches H100's pre-fix collapse pattern).
|
| 71 |
+
* 226075 hit `eval_loss=NaN @ 1500` β almost certainly the same
|
| 72 |
+
leak triggering numeric instability once the assistant span aligns
|
| 73 |
+
with the leaky prompt fragment.
|
| 74 |
+
* They're at 28β34 % of one epoch. Sunk cost β 8 h Γ 3 GPUs.
|
| 75 |
+
* Continuing them produces a paper row that **will be invalid**
|
| 76 |
+
(training on cheat-able data), which a reviewer will flag if we
|
| 77 |
+
publish. Better to take the 8h hit now than the rejection.
|
| 78 |
+
* Resubmit on `bb9a5f1` (which has `cb91c26` + `bda9ee0` + Triton
|
| 79 |
+
cache fix). Same recipe, ~3 days for the new epoch.
|
| 80 |
+
|
| 81 |
+
If the user confirms, lab can `scancel 226075 226076 226077` and
|
| 82 |
+
resubmit `slurm/run_unified_arch_ablation.sh` from the new HEAD.
|
| 83 |
+
|
| 84 |
+
### Decision 2 β T2 regen v5 (192 shards on galaxy)
|
| 85 |
+
|
| 86 |
+
**Recommend: lab launches now.** Lab's runbook
|
| 87 |
+
[`docs/t2_regen_runbook.md`](docs/t2_regen_runbook.md) covers it
|
| 88 |
+
explicitly: `sbatch regureasoner_loop/slurm/run_t2_regen_enhscan_galaxy.sh`
|
| 89 |
+
on lab cluster. PYTHON_BIN is hardcoded in the wrapper, so the v4
|
| 90 |
+
silent-exit bug is fixed.
|
| 91 |
+
|
| 92 |
+
Expected wall-clock: **~2 days for full T2 train+test (~32 GB)**
|
| 93 |
+
sharded 192-way; cache-warmed on T1 promoter scans.
|
| 94 |
+
|
| 95 |
+
When this lands β unblocks H100 Stage 3e (T2 reasoning expansion)
|
| 96 |
+
and Stage 1b T2 fusion-SFT bench-rerun.
|
| 97 |
+
|
| 98 |
+
## What's happening on lab cluster regardless of user decisions
|
| 99 |
+
|
| 100 |
+
| Job | What | Status |
|
| 101 |
+
|---|---|---|
|
| 102 |
+
| 226049 | T2 pair_aux=none (asym pair) | training 20+h, **NOT affected** by `cb91c26` (asym pair trainer, not unified collator). Keep running. |
|
| 103 |
+
| 226050 | T2 pair_aux=supcon_pair | same |
|
| 104 |
+
| 226051 | T2 pair_aux=tier_aware_supcon | same |
|
| 105 |
+
| 226057 | SV-GSPO v5 | running 8h+; SV-GSPO outcome reward used the simple `outcome_enhancer_*` scorer that was BROKEN until `e133cf1` β but lab job 226057 was submitted before `e133cf1` lands. **Note for lab**: when the next SV-GSPO run goes, it'll be on the post-bench-fix code. Existing one will give a "before" baseline. |
|
| 106 |
+
| 226086 | NTv3-8m encoder grid | crashed; fix is in `0b30598`; lab to resubmit |
|
| 107 |
+
| 226090 | HF upload T1+T3 | DONE |
|
| 108 |
+
| 225956 | Enformer oracle (53 h hung) | open question β still in lab's "needs investigation" |
|
| 109 |
+
|
| 110 |
+
## Total ETA β H100 side (auto-firing chain)
|
| 111 |
+
|
| 112 |
+
```
|
| 113 |
+
Now T3 zs_raw bench finalising flush ~30 min
|
| 114 |
+
+30 min T3 zs_raw genqual + oracle (reaper) ~30 min CPU / 5 min GPU
|
| 115 |
+
+1 h T3 zs_enriched bench ~5 h
|
| 116 |
+
+6 h T3 zs_enriched genqual + oracle ~30 min
|
| 117 |
+
+6.5 h post_bench_pipeline.sh fires:
|
| 118 |
+
Stage 0/0c score zs predictions (already done by reaper)
|
| 119 |
+
Stages 1-4 fusion-SFT Γ {T1,T2,T3,joint} + score_adapter ~22 h
|
| 120 |
+
Stage 3b T3 reasoning-only ~3 h
|
| 121 |
+
Stage 3c T3 RFT (multi-turn) ~5 h
|
| 122 |
+
Stage 3d T3 reasoning-expansion (333/day)
|
| 123 |
+
Stage 3e T2 reasoning-expansion (gated on lab T2 regen)
|
| 124 |
+
Stage 3f T1 reasoning-expansion (333/day, idempotent)
|
| 125 |
+
Stages 5-6 NTv3-only T1+T2 baselines ~4 h
|
| 126 |
+
Stage 7 aggregator + final HF push minutes
|
| 127 |
+
+45 h headline T1 / T2 / T3 / joint multitask H100 numbers ready
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
Lab side, in parallel: Stage 4-arch (after scancel + restart, ~3
|
| 131 |
+
days), T2 regen v5 (~2 days), T2 pair_aux ablation (~30 h to
|
| 132 |
+
completion), SV-GSPO (open-ended training).
|
| 133 |
+
|
| 134 |
+
## What I'm NOT doing without explicit OK (auto-mode rule)
|
| 135 |
+
|
| 136 |
+
* Do not unilaterally `scancel` lab's running jobs.
|
| 137 |
+
* Do not spend H100 GPU on duplicating lab's arch ablation (lab is
|
| 138 |
+
doing 3-way arch in parallel; H100 stays on the LLaVA headline
|
| 139 |
+
chain).
|
| 140 |
+
* Do not push to `mllm-integrate` (H100 only pushes to `server2`;
|
| 141 |
+
lab merges).
|
| 142 |
+
|
| 143 |
+
## Next H100-side action (auto)
|
| 144 |
+
|
| 145 |
+
When the T3 zs_raw bench produces `predictions.jsonl`:
|
| 146 |
+
1. Reaper auto-fires `eval_t3_oracle.py` β headline T3 zs_raw oracle
|
| 147 |
+
metrics on the full 372 k test set (the row in Table 1's T3
|
| 148 |
+
column for "Zero-shot LLM").
|
| 149 |
+
2. Will push to HF + GitHub commit when the genqual JSON lands.
|
| 150 |
+
|
| 151 |
+
Standing by; Monitor `biuq50hlx` was timing out β re-arm with
|
| 152 |
+
extended timeout below.
|