explcre
/

dnathinker-checkpoints

Model card Files Files and versions

xet

Community

explcre commited on 12 days ago

Commit

9e147e3

verified ·

1 Parent(s): 9e7e687

Upload results/h100_snapshot.md with huggingface_hub

Browse files

Files changed (1) hide show

results/h100_snapshot.md +104 -0

results/h100_snapshot.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# H100 results snapshot — 2026-04-27 04:10 UTC
+Live as of this commit. All numbers are full-test (no smoke cuts).
+## Bench grid status
+| Task | Prompt | n | Status | Wall-clock |
+|---|---|---|---|---|
+| T1 enhancer_generation | raw      | 372,210 | DONE | finished earlier today |
+| T1 enhancer_generation | enriched | 372,210 | DONE | finished earlier today |
+| T2 pair_prediction     | raw      | 744,420 | DONE | finished earlier today |
+| **T2 pair_prediction** | **enriched** | **744,420** | **DONE** | **just landed (~3.5h)** |
+| T3 enhancer_editing    | raw      | (~372k) | RUNNING (PID 139902, 6 min in) | ~5 h ETA |
+| T3 enhancer_editing    | enriched | (~372k) | queued | ~5 h ETA |
+## T1 — full 372k, 7-cell breakdown
+`runs/exp_t1_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
+| Metric | zs_raw | zs_enriched |
+|---|---|---|
+| `parse_rate` | 0.9996 | 0.9997 |
+| `mean_gc_abs_err` | 0.116 | 0.126 |
+| `mean_length_ratio` | **1.64** | **1.67** |
+Per-cell (n shown for context):
+| Cell | n | zs_raw len_ratio | zs_enriched len_ratio | gc_err raw | gc_err enriched |
+|---|---|---|---|---|---|
+| Ex | 86,088 | 1.63 | 1.67 | 0.115 | 0.128 |
+| Mic | 74,828 | 1.64 | 1.69 | 0.113 | 0.123 |
+| Oli | 63,278 | 1.64 | 1.68 | 0.119 | 0.124 |
+| In | 50,872 | 1.63 | 1.65 | 0.116 | 0.128 |
+| Ast | 48,623 | 1.64 | 1.66 | 0.116 | 0.125 |
+| OPC | 40,162 | 1.64 | 1.66 | 0.115 | 0.122 |
+| End | 8,359 | 1.65 | 1.70 | 0.118 | 0.137 |
+**Reading**: zero-shot Qwen3.5-2B over-generates by ~65% (length_ratio
+1.64–1.67 vs target 1.00). Tool-enriched is slightly worse on length
+(1.67 vs 1.64) and slightly worse on GC error (0.126 vs 0.116) —
+adding the tool_context block confuses the small model rather than
+helping it.
+**Next number to land**: post-bench Stage 1 will train the T1
+fusion-SFT adapter, then Stage 1b runs `predict_fusion.py` +
+`run_generation_eval.py` + `eval_t3_oracle.py` (no, T1 not T3 — sorry,
+just `run_generation_eval.py` for FBD/spec/argmax) on the trained
+adapter → first oracle metrics for T1 fusion-SFT will be in
+`runs/exp_t1_fusion_sft_20260427_h100/predict_t1_{raw,enriched}/genqual/genqual.json`.
+## T2 — full 744k, 7-cell breakdown
+`runs/exp_t2_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
+| Metric | zs_raw | zs_enriched |
+|---|---|---|
+| `accuracy` | 0.500 | **0.500** |
+| `f1` | 0.0001 | **0.002** |
+| `precision` | 0.65 | 0.58 |
+| `recall` | 0.00003 | **0.001** |
+| `parse_rate` | 1.000 | 1.000 |
+Per-cell precision (zs_enriched):
+| Cell | n | precision | recall |
+|---|---|---|---|
+| Oli | 126,556 | 0.68 | 0.0005 |
+| Ex  | 172,176 | 0.64 | 0.0015 |
+| OPC |  80,324 | 0.64 | 0.0002 |
+| End |  16,718 | 0.56 | 0.0011 |
+| Mic | 149,656 | 0.55 | 0.0004 |
+| In  | 101,744 | 0.54 | 0.0011 |
+| Ast |  97,246 | 0.51 | 0.0021 |
+**Reading**: zero-shot is **degenerate** — almost always predicts
+`not_paired`, gets recall ~0.001 across all cells. Tool-enriched
+slightly better (F1 0.002 vs 0.0001) but still effectively useless
+without fine-tuning. Also: per the 04:00 lab message §2, the **T2
+enhancer side has no TFBS scan in the prod tool_context** — the model
+is being asked to reason about pairing using only the promoter's
+TFBS. That's the bigger fix; needs galaxy-side regen.
+## T3 — pending
+T3 zs_raw started at 04:04 UTC (running now). After both T3 zs benches
+finish, post_bench_pipeline.sh fires automatically and the v5 chain
+runs (Stages 0c, 1, 1b, 2, 2b, 3, 3b, 3c, 3d, 4, 4b ×3, 5, 6, 7).
+## Side artefacts
+* T1 reasoning expansion (Ling-2.6-1T) — 146/333 done at 04:10 UTC,
+  ~30 min remaining. `data/reasoning_traces/train.enhancer_generation.reasoning.jsonl`.
+* Multi-turn RFT + SV-GSPO reward fix (commits `25504fd`, `e133cf1`).
+* SFT collator now sanitises before tokenisation (commit `bda9ee0`,
+  pre-flighted before Stages 1–4 fire — without this, every fusion-SFT
+  run would have trained on leaky data).
+## What's NOT yet in the chain
+* T3-RFT-from-joint ablation — see `t3_post_v5_followups.md` §1; lab
+  GPU welcome.
+* Loop-SFT on post-RFT JSONL — see `t3_post_v5_followups.md` §3.
+* T2 enhancer TFBS scan regen — needs galaxy CPU; see lab message v2 §2.
+* TACO + HyenaDNA external baselines — see lab message v2 §5.