explcre commited on
Commit
a233b8a
Β·
verified Β·
1 Parent(s): 67bf754

Upload results/status_20260427_0700utc.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. results/status_20260427_0700utc.md +152 -0
results/status_20260427_0700utc.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pipeline status β€” 2026-04-27 ~07:00 UTC
2
+
3
+ Lab pushed 6 commits between 04:40 and 07:17 UTC; H100 has merged them
4
+ into `mllm-integrate-server2` (HEAD `f97f054`). Auto-mode summary +
5
+ user decisions blocked on confirmation.
6
+
7
+ ## Auto-progress on H100 (no user input needed)
8
+
9
+ | Item | State |
10
+ |---|---|
11
+ | **T3 zs_raw bench** (PID 139902) | running, ~02:55 elapsed; tqdm shows N/N (display tick β€” actual `predictions.jsonl` not yet flushed) |
12
+ | **T3 zs_enriched bench** | queued; auto-fires when zs_raw exits |
13
+ | **Reaper PID 146066** | alive; auto-scores every fresh `predictions.jsonl` with `eval_t3_oracle.py` + `run_generation_eval.py` (`--device auto` β†’ CPU now, GPU after vLLM frees) |
14
+ | **T1 zs_raw oracle metrics** | LANDED (genqual.json on HF; per-cell numbers in `results/zeroshot_results_table_20260427.md`) |
15
+ | **T1 zs_enriched oracle metrics** | scoring (CPU; ETA ~25 min more) |
16
+ | **T1 reasoning expansion 333/333** | DONE; HF mirrored |
17
+ | **Branch sync** | merged lab's 6 commits; pushed `f97f054`; **0 behind, 2 ahead** |
18
+
19
+ ## Lab updates absorbed
20
+
21
+ | Commit | What |
22
+ |---|---|
23
+ | `bb9a5f1` | Lab merged H100's `4149b16 + 694181f + cb91c26 + bda9ee0` into `mllm-integrate` |
24
+ | `0b30598` | Per-job TRITON_CACHE_DIR fix β€” unblocks 226086 (NTv3-8m encoder) re-run |
25
+ | `3b40df9` | Lab snapshot 07:17 UTC + the explicit "scancel 226075/076/077?" decision |
26
+ | `ff3ab4d` | full_enriched data is the headline; prod_samples is iteration |
27
+ | `c4d8981` | second sync of `mllm-integrate-server2` into `mllm-integrate` |
28
+ | `db4ac99` | T2 regen v5 runbook β€” 192 shards (3Γ— speedup) instead of 64 |
29
+
30
+ ## Lab full enriched JSONL β€” landed on HF
31
+
32
+ ```
33
+ explcre/celltype_conditioned_enhancer_generation/data/full_enriched/jsonl/
34
+ β”œβ”€β”€ train.enhancer_generation.jsonl 14.2 GB (1,509,379 rows) ← T1 train
35
+ β”œβ”€β”€ test.enhancer_generation.jsonl 3.50 GB (372,210 rows) ← T1 test (== H100's prod_full_test)
36
+ β”œβ”€β”€ train.enhancer_editing.jsonl ~14 GB (1,509,379 rows) ← T3 train
37
+ └── test.enhancer_editing.jsonl 3.69 GB (372,210 rows) ← T3 test
38
+ ```
39
+
40
+ T2 still pending (galaxy regen v5 must succeed first; see "Decision 2"
41
+ below).
42
+
43
+ **Key match**: H100's `data/prod_full_test/jsonl/test.enhancer_generation.jsonl`
44
+ is **bit-identical** to lab's `data/full_enriched/test.enhancer_generation.jsonl`
45
+ (both 372,210 rows, 3.50 GB). H100's existing zs benches are already
46
+ on the headline test set.
47
+
48
+ What changes when we use `full_enriched/train.*.jsonl` (1.5 M rows)
49
+ vs the legacy `prod_samples/strat7c.n35k.jsonl` (35 k rows): training
50
+ sees ~43Γ— more data, takes ~43Γ— longer (a Stage-1 fusion-SFT goes
51
+ from ~3 h on H100 to ~5 days). Tradeoff:
52
+
53
+ * **Lab side**: spending the GPU-days on full_enriched IS the headline.
54
+ * **H100 side**: stays on `prod_samples/n35k` for fast iteration cycles
55
+ (one-day end-to-end pipeline, multiple ablations / tier
56
+ comparisons). The two together fill the paper table β€” H100's
57
+ smaller-N runs serve as "controlled" ablation rows; lab's
58
+ larger-N run is the headline.
59
+
60
+ ## ⚠️ Two decisions blocked on user (per auto-mode rule on destructive actions)
61
+
62
+ ### Decision 1 β€” scancel 226075/076/077 arch ablation?
63
+
64
+ **Recommend: YES, scancel all three.** Lab is "awaiting user OK" per
65
+ their snapshot. Reasoning:
66
+
67
+ * The three jobs were submitted from `ff3ab4d` (predates `cb91c26`
68
+ which fixes the unified collator's training-time leak).
69
+ * Lab confirmed the leak signature in 226076 (eval=0.036 at step 1000
70
+ matches H100's pre-fix collapse pattern).
71
+ * 226075 hit `eval_loss=NaN @ 1500` β€” almost certainly the same
72
+ leak triggering numeric instability once the assistant span aligns
73
+ with the leaky prompt fragment.
74
+ * They're at 28–34 % of one epoch. Sunk cost β‰ˆ 8 h Γ— 3 GPUs.
75
+ * Continuing them produces a paper row that **will be invalid**
76
+ (training on cheat-able data), which a reviewer will flag if we
77
+ publish. Better to take the 8h hit now than the rejection.
78
+ * Resubmit on `bb9a5f1` (which has `cb91c26` + `bda9ee0` + Triton
79
+ cache fix). Same recipe, ~3 days for the new epoch.
80
+
81
+ If the user confirms, lab can `scancel 226075 226076 226077` and
82
+ resubmit `slurm/run_unified_arch_ablation.sh` from the new HEAD.
83
+
84
+ ### Decision 2 β€” T2 regen v5 (192 shards on galaxy)
85
+
86
+ **Recommend: lab launches now.** Lab's runbook
87
+ [`docs/t2_regen_runbook.md`](docs/t2_regen_runbook.md) covers it
88
+ explicitly: `sbatch regureasoner_loop/slurm/run_t2_regen_enhscan_galaxy.sh`
89
+ on lab cluster. PYTHON_BIN is hardcoded in the wrapper, so the v4
90
+ silent-exit bug is fixed.
91
+
92
+ Expected wall-clock: **~2 days for full T2 train+test (~32 GB)**
93
+ sharded 192-way; cache-warmed on T1 promoter scans.
94
+
95
+ When this lands β†’ unblocks H100 Stage 3e (T2 reasoning expansion)
96
+ and Stage 1b T2 fusion-SFT bench-rerun.
97
+
98
+ ## What's happening on lab cluster regardless of user decisions
99
+
100
+ | Job | What | Status |
101
+ |---|---|---|
102
+ | 226049 | T2 pair_aux=none (asym pair) | training 20+h, **NOT affected** by `cb91c26` (asym pair trainer, not unified collator). Keep running. |
103
+ | 226050 | T2 pair_aux=supcon_pair | same |
104
+ | 226051 | T2 pair_aux=tier_aware_supcon | same |
105
+ | 226057 | SV-GSPO v5 | running 8h+; SV-GSPO outcome reward used the simple `outcome_enhancer_*` scorer that was BROKEN until `e133cf1` β€” but lab job 226057 was submitted before `e133cf1` lands. **Note for lab**: when the next SV-GSPO run goes, it'll be on the post-bench-fix code. Existing one will give a "before" baseline. |
106
+ | 226086 | NTv3-8m encoder grid | crashed; fix is in `0b30598`; lab to resubmit |
107
+ | 226090 | HF upload T1+T3 | DONE |
108
+ | 225956 | Enformer oracle (53 h hung) | open question β€” still in lab's "needs investigation" |
109
+
110
+ ## Total ETA β€” H100 side (auto-firing chain)
111
+
112
+ ```
113
+ Now T3 zs_raw bench finalising flush ~30 min
114
+ +30 min T3 zs_raw genqual + oracle (reaper) ~30 min CPU / 5 min GPU
115
+ +1 h T3 zs_enriched bench ~5 h
116
+ +6 h T3 zs_enriched genqual + oracle ~30 min
117
+ +6.5 h post_bench_pipeline.sh fires:
118
+ Stage 0/0c score zs predictions (already done by reaper)
119
+ Stages 1-4 fusion-SFT Γ— {T1,T2,T3,joint} + score_adapter ~22 h
120
+ Stage 3b T3 reasoning-only ~3 h
121
+ Stage 3c T3 RFT (multi-turn) ~5 h
122
+ Stage 3d T3 reasoning-expansion (333/day)
123
+ Stage 3e T2 reasoning-expansion (gated on lab T2 regen)
124
+ Stage 3f T1 reasoning-expansion (333/day, idempotent)
125
+ Stages 5-6 NTv3-only T1+T2 baselines ~4 h
126
+ Stage 7 aggregator + final HF push minutes
127
+ +45 h headline T1 / T2 / T3 / joint multitask H100 numbers ready
128
+ ```
129
+
130
+ Lab side, in parallel: Stage 4-arch (after scancel + restart, ~3
131
+ days), T2 regen v5 (~2 days), T2 pair_aux ablation (~30 h to
132
+ completion), SV-GSPO (open-ended training).
133
+
134
+ ## What I'm NOT doing without explicit OK (auto-mode rule)
135
+
136
+ * Do not unilaterally `scancel` lab's running jobs.
137
+ * Do not spend H100 GPU on duplicating lab's arch ablation (lab is
138
+ doing 3-way arch in parallel; H100 stays on the LLaVA headline
139
+ chain).
140
+ * Do not push to `mllm-integrate` (H100 only pushes to `server2`;
141
+ lab merges).
142
+
143
+ ## Next H100-side action (auto)
144
+
145
+ When the T3 zs_raw bench produces `predictions.jsonl`:
146
+ 1. Reaper auto-fires `eval_t3_oracle.py` β†’ headline T3 zs_raw oracle
147
+ metrics on the full 372 k test set (the row in Table 1's T3
148
+ column for "Zero-shot LLM").
149
+ 2. Will push to HF + GitHub commit when the genqual JSON lands.
150
+
151
+ Standing by; Monitor `biuq50hlx` was timing out β€” re-arm with
152
+ extended timeout below.