explcre
/

dnathinker-checkpoints

Model card Files Files and versions

xet

Community

explcre commited on Apr 27

Commit

a233b8a

verified ·

1 Parent(s): 67bf754

Upload results/status_20260427_0700utc.md with huggingface_hub

Browse files

Files changed (1) hide show

results/status_20260427_0700utc.md +152 -0

results/status_20260427_0700utc.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Pipeline status — 2026-04-27 ~07:00 UTC
+Lab pushed 6 commits between 04:40 and 07:17 UTC; H100 has merged them
+into `mllm-integrate-server2` (HEAD `f97f054`). Auto-mode summary +
+user decisions blocked on confirmation.
+## Auto-progress on H100 (no user input needed)
+| Item | State |
+|---|---|
+| **T3 zs_raw bench** (PID 139902) | running, ~02:55 elapsed; tqdm shows N/N (display tick — actual `predictions.jsonl` not yet flushed) |
+| **T3 zs_enriched bench** | queued; auto-fires when zs_raw exits |
+| **Reaper PID 146066** | alive; auto-scores every fresh `predictions.jsonl` with `eval_t3_oracle.py` + `run_generation_eval.py` (`--device auto` → CPU now, GPU after vLLM frees) |
+| **T1 zs_raw oracle metrics** | LANDED (genqual.json on HF; per-cell numbers in `results/zeroshot_results_table_20260427.md`) |
+| **T1 zs_enriched oracle metrics** | scoring (CPU; ETA ~25 min more) |
+| **T1 reasoning expansion 333/333** | DONE; HF mirrored |
+| **Branch sync** | merged lab's 6 commits; pushed `f97f054`; **0 behind, 2 ahead** |
+## Lab updates absorbed
+| Commit | What |
+|---|---|
+| `bb9a5f1` | Lab merged H100's `4149b16 + 694181f + cb91c26 + bda9ee0` into `mllm-integrate` |
+| `0b30598` | Per-job TRITON_CACHE_DIR fix — unblocks 226086 (NTv3-8m encoder) re-run |
+| `3b40df9` | Lab snapshot 07:17 UTC + the explicit "scancel 226075/076/077?" decision |
+| `ff3ab4d` | full_enriched data is the headline; prod_samples is iteration |
+| `c4d8981` | second sync of `mllm-integrate-server2` into `mllm-integrate` |
+| `db4ac99` | T2 regen v5 runbook — 192 shards (3× speedup) instead of 64 |
+## Lab full enriched JSONL — landed on HF
+```
+explcre/celltype_conditioned_enhancer_generation/data/full_enriched/jsonl/
+├── train.enhancer_generation.jsonl   14.2 GB   (1,509,379 rows)   ← T1 train
+├── test.enhancer_generation.jsonl     3.50 GB    (372,210 rows)   ← T1 test (== H100's prod_full_test)
+├── train.enhancer_editing.jsonl      ~14 GB    (1,509,379 rows)   ← T3 train
+└── test.enhancer_editing.jsonl        3.69 GB    (372,210 rows)   ← T3 test
+```
+T2 still pending (galaxy regen v5 must succeed first; see "Decision 2"
+below).
+**Key match**: H100's `data/prod_full_test/jsonl/test.enhancer_generation.jsonl`
+is **bit-identical** to lab's `data/full_enriched/test.enhancer_generation.jsonl`
+(both 372,210 rows, 3.50 GB). H100's existing zs benches are already
+on the headline test set.
+What changes when we use `full_enriched/train.*.jsonl` (1.5 M rows)
+vs the legacy `prod_samples/strat7c.n35k.jsonl` (35 k rows): training
+sees ~43× more data, takes ~43× longer (a Stage-1 fusion-SFT goes
+from ~3 h on H100 to ~5 days). Tradeoff:
+* **Lab side**: spending the GPU-days on full_enriched IS the headline.
+* **H100 side**: stays on `prod_samples/n35k` for fast iteration cycles
+  (one-day end-to-end pipeline, multiple ablations / tier
+  comparisons). The two together fill the paper table — H100's
+  smaller-N runs serve as "controlled" ablation rows; lab's
+  larger-N run is the headline.
+## ⚠️ Two decisions blocked on user (per auto-mode rule on destructive actions)
+### Decision 1 — scancel 226075/076/077 arch ablation?
+**Recommend: YES, scancel all three.** Lab is "awaiting user OK" per
+their snapshot. Reasoning:
+* The three jobs were submitted from `ff3ab4d` (predates `cb91c26`
+  which fixes the unified collator's training-time leak).
+* Lab confirmed the leak signature in 226076 (eval=0.036 at step 1000
+  matches H100's pre-fix collapse pattern).
+* 226075 hit `eval_loss=NaN @ 1500` — almost certainly the same
+  leak triggering numeric instability once the assistant span aligns
+  with the leaky prompt fragment.
+* They're at 28–34 % of one epoch. Sunk cost ≈ 8 h × 3 GPUs.
+* Continuing them produces a paper row that **will be invalid**
+  (training on cheat-able data), which a reviewer will flag if we
+  publish. Better to take the 8h hit now than the rejection.
+* Resubmit on `bb9a5f1` (which has `cb91c26` + `bda9ee0` + Triton
+  cache fix). Same recipe, ~3 days for the new epoch.
+If the user confirms, lab can `scancel 226075 226076 226077` and
+resubmit `slurm/run_unified_arch_ablation.sh` from the new HEAD.
+### Decision 2 — T2 regen v5 (192 shards on galaxy)
+**Recommend: lab launches now.** Lab's runbook
+[`docs/t2_regen_runbook.md`](docs/t2_regen_runbook.md) covers it
+explicitly: `sbatch regureasoner_loop/slurm/run_t2_regen_enhscan_galaxy.sh`
+on lab cluster. PYTHON_BIN is hardcoded in the wrapper, so the v4
+silent-exit bug is fixed.
+Expected wall-clock: **~2 days for full T2 train+test (~32 GB)**
+sharded 192-way; cache-warmed on T1 promoter scans.
+When this lands → unblocks H100 Stage 3e (T2 reasoning expansion)
+and Stage 1b T2 fusion-SFT bench-rerun.
+## What's happening on lab cluster regardless of user decisions
+| Job | What | Status |
+|---|---|---|
+| 226049 | T2 pair_aux=none (asym pair) | training 20+h, **NOT affected** by `cb91c26` (asym pair trainer, not unified collator). Keep running. |
+| 226050 | T2 pair_aux=supcon_pair | same |
+| 226051 | T2 pair_aux=tier_aware_supcon | same |
+| 226057 | SV-GSPO v5 | running 8h+; SV-GSPO outcome reward used the simple `outcome_enhancer_*` scorer that was BROKEN until `e133cf1` — but lab job 226057 was submitted before `e133cf1` lands. **Note for lab**: when the next SV-GSPO run goes, it'll be on the post-bench-fix code. Existing one will give a "before" baseline. |
+| 226086 | NTv3-8m encoder grid | crashed; fix is in `0b30598`; lab to resubmit |
+| 226090 | HF upload T1+T3 | DONE |
+| 225956 | Enformer oracle (53 h hung) | open question — still in lab's "needs investigation" |
+## Total ETA — H100 side (auto-firing chain)
+```
+Now            T3 zs_raw bench finalising flush                ~30 min
++30 min        T3 zs_raw genqual + oracle (reaper)            ~30 min CPU / 5 min GPU
++1 h           T3 zs_enriched bench                            ~5 h
++6 h           T3 zs_enriched genqual + oracle                 ~30 min
++6.5 h         post_bench_pipeline.sh fires:
+                 Stage 0/0c    score zs predictions             (already done by reaper)
+                 Stages 1-4    fusion-SFT × {T1,T2,T3,joint} + score_adapter        ~22 h
+                 Stage 3b      T3 reasoning-only                ~3 h
+                 Stage 3c      T3 RFT (multi-turn)              ~5 h
+                 Stage 3d      T3 reasoning-expansion (333/day)
+                 Stage 3e      T2 reasoning-expansion (gated on lab T2 regen)
+                 Stage 3f      T1 reasoning-expansion (333/day, idempotent)
+                 Stages 5-6    NTv3-only T1+T2 baselines        ~4 h
+                 Stage 7       aggregator + final HF push       minutes
++45 h          headline T1 / T2 / T3 / joint multitask H100 numbers ready
+```
+Lab side, in parallel: Stage 4-arch (after scancel + restart, ~3
+days), T2 regen v5 (~2 days), T2 pair_aux ablation (~30 h to
+completion), SV-GSPO (open-ended training).
+## What I'm NOT doing without explicit OK (auto-mode rule)
+* Do not unilaterally `scancel` lab's running jobs.
+* Do not spend H100 GPU on duplicating lab's arch ablation (lab is
+  doing 3-way arch in parallel; H100 stays on the LLaVA headline
+  chain).
+* Do not push to `mllm-integrate` (H100 only pushes to `server2`;
+  lab merges).
+## Next H100-side action (auto)
+When the T3 zs_raw bench produces `predictions.jsonl`:
+1. Reaper auto-fires `eval_t3_oracle.py` → headline T3 zs_raw oracle
+   metrics on the full 372 k test set (the row in Table 1's T3
+   column for "Zero-shot LLM").
+2. Will push to HF + GitHub commit when the genqual JSON lands.
+Standing by; Monitor `biuq50hlx` was timing out — re-arm with
+extended timeout below.