Upload results/h100_snapshot.md with huggingface_hub
Browse files- results/h100_snapshot.md +104 -0
results/h100_snapshot.md
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# H100 results snapshot β 2026-04-27 04:10 UTC
|
| 2 |
+
|
| 3 |
+
Live as of this commit. All numbers are full-test (no smoke cuts).
|
| 4 |
+
|
| 5 |
+
## Bench grid status
|
| 6 |
+
|
| 7 |
+
| Task | Prompt | n | Status | Wall-clock |
|
| 8 |
+
|---|---|---|---|---|
|
| 9 |
+
| T1 enhancer_generation | raw | 372,210 | DONE | finished earlier today |
|
| 10 |
+
| T1 enhancer_generation | enriched | 372,210 | DONE | finished earlier today |
|
| 11 |
+
| T2 pair_prediction | raw | 744,420 | DONE | finished earlier today |
|
| 12 |
+
| **T2 pair_prediction** | **enriched** | **744,420** | **DONE** | **just landed (~3.5h)** |
|
| 13 |
+
| T3 enhancer_editing | raw | (~372k) | RUNNING (PID 139902, 6 min in) | ~5 h ETA |
|
| 14 |
+
| T3 enhancer_editing | enriched | (~372k) | queued | ~5 h ETA |
|
| 15 |
+
|
| 16 |
+
## T1 β full 372k, 7-cell breakdown
|
| 17 |
+
|
| 18 |
+
`runs/exp_t1_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
|
| 19 |
+
|
| 20 |
+
| Metric | zs_raw | zs_enriched |
|
| 21 |
+
|---|---|---|
|
| 22 |
+
| `parse_rate` | 0.9996 | 0.9997 |
|
| 23 |
+
| `mean_gc_abs_err` | 0.116 | 0.126 |
|
| 24 |
+
| `mean_length_ratio` | **1.64** | **1.67** |
|
| 25 |
+
|
| 26 |
+
Per-cell (n shown for context):
|
| 27 |
+
|
| 28 |
+
| Cell | n | zs_raw len_ratio | zs_enriched len_ratio | gc_err raw | gc_err enriched |
|
| 29 |
+
|---|---|---|---|---|---|
|
| 30 |
+
| Ex | 86,088 | 1.63 | 1.67 | 0.115 | 0.128 |
|
| 31 |
+
| Mic | 74,828 | 1.64 | 1.69 | 0.113 | 0.123 |
|
| 32 |
+
| Oli | 63,278 | 1.64 | 1.68 | 0.119 | 0.124 |
|
| 33 |
+
| In | 50,872 | 1.63 | 1.65 | 0.116 | 0.128 |
|
| 34 |
+
| Ast | 48,623 | 1.64 | 1.66 | 0.116 | 0.125 |
|
| 35 |
+
| OPC | 40,162 | 1.64 | 1.66 | 0.115 | 0.122 |
|
| 36 |
+
| End | 8,359 | 1.65 | 1.70 | 0.118 | 0.137 |
|
| 37 |
+
|
| 38 |
+
**Reading**: zero-shot Qwen3.5-2B over-generates by ~65% (length_ratio
|
| 39 |
+
1.64β1.67 vs target 1.00). Tool-enriched is slightly worse on length
|
| 40 |
+
(1.67 vs 1.64) and slightly worse on GC error (0.126 vs 0.116) β
|
| 41 |
+
adding the tool_context block confuses the small model rather than
|
| 42 |
+
helping it.
|
| 43 |
+
|
| 44 |
+
**Next number to land**: post-bench Stage 1 will train the T1
|
| 45 |
+
fusion-SFT adapter, then Stage 1b runs `predict_fusion.py` +
|
| 46 |
+
`run_generation_eval.py` + `eval_t3_oracle.py` (no, T1 not T3 β sorry,
|
| 47 |
+
just `run_generation_eval.py` for FBD/spec/argmax) on the trained
|
| 48 |
+
adapter β first oracle metrics for T1 fusion-SFT will be in
|
| 49 |
+
`runs/exp_t1_fusion_sft_20260427_h100/predict_t1_{raw,enriched}/genqual/genqual.json`.
|
| 50 |
+
|
| 51 |
+
## T2 β full 744k, 7-cell breakdown
|
| 52 |
+
|
| 53 |
+
`runs/exp_t2_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
|
| 54 |
+
|
| 55 |
+
| Metric | zs_raw | zs_enriched |
|
| 56 |
+
|---|---|---|
|
| 57 |
+
| `accuracy` | 0.500 | **0.500** |
|
| 58 |
+
| `f1` | 0.0001 | **0.002** |
|
| 59 |
+
| `precision` | 0.65 | 0.58 |
|
| 60 |
+
| `recall` | 0.00003 | **0.001** |
|
| 61 |
+
| `parse_rate` | 1.000 | 1.000 |
|
| 62 |
+
|
| 63 |
+
Per-cell precision (zs_enriched):
|
| 64 |
+
|
| 65 |
+
| Cell | n | precision | recall |
|
| 66 |
+
|---|---|---|---|
|
| 67 |
+
| Oli | 126,556 | 0.68 | 0.0005 |
|
| 68 |
+
| Ex | 172,176 | 0.64 | 0.0015 |
|
| 69 |
+
| OPC | 80,324 | 0.64 | 0.0002 |
|
| 70 |
+
| End | 16,718 | 0.56 | 0.0011 |
|
| 71 |
+
| Mic | 149,656 | 0.55 | 0.0004 |
|
| 72 |
+
| In | 101,744 | 0.54 | 0.0011 |
|
| 73 |
+
| Ast | 97,246 | 0.51 | 0.0021 |
|
| 74 |
+
|
| 75 |
+
**Reading**: zero-shot is **degenerate** β almost always predicts
|
| 76 |
+
`not_paired`, gets recall ~0.001 across all cells. Tool-enriched
|
| 77 |
+
slightly better (F1 0.002 vs 0.0001) but still effectively useless
|
| 78 |
+
without fine-tuning. Also: per the 04:00 lab message Β§2, the **T2
|
| 79 |
+
enhancer side has no TFBS scan in the prod tool_context** β the model
|
| 80 |
+
is being asked to reason about pairing using only the promoter's
|
| 81 |
+
TFBS. That's the bigger fix; needs galaxy-side regen.
|
| 82 |
+
|
| 83 |
+
## T3 β pending
|
| 84 |
+
|
| 85 |
+
T3 zs_raw started at 04:04 UTC (running now). After both T3 zs benches
|
| 86 |
+
finish, post_bench_pipeline.sh fires automatically and the v5 chain
|
| 87 |
+
runs (Stages 0c, 1, 1b, 2, 2b, 3, 3b, 3c, 3d, 4, 4b Γ3, 5, 6, 7).
|
| 88 |
+
|
| 89 |
+
## Side artefacts
|
| 90 |
+
|
| 91 |
+
* T1 reasoning expansion (Ling-2.6-1T) β 146/333 done at 04:10 UTC,
|
| 92 |
+
~30 min remaining. `data/reasoning_traces/train.enhancer_generation.reasoning.jsonl`.
|
| 93 |
+
* Multi-turn RFT + SV-GSPO reward fix (commits `25504fd`, `e133cf1`).
|
| 94 |
+
* SFT collator now sanitises before tokenisation (commit `bda9ee0`,
|
| 95 |
+
pre-flighted before Stages 1β4 fire β without this, every fusion-SFT
|
| 96 |
+
run would have trained on leaky data).
|
| 97 |
+
|
| 98 |
+
## What's NOT yet in the chain
|
| 99 |
+
|
| 100 |
+
* T3-RFT-from-joint ablation β see `t3_post_v5_followups.md` Β§1; lab
|
| 101 |
+
GPU welcome.
|
| 102 |
+
* Loop-SFT on post-RFT JSONL β see `t3_post_v5_followups.md` Β§3.
|
| 103 |
+
* T2 enhancer TFBS scan regen β needs galaxy CPU; see lab message v2 Β§2.
|
| 104 |
+
* TACO + HyenaDNA external baselines β see lab message v2 Β§5.
|