explcre commited on
Commit
9e147e3
Β·
verified Β·
1 Parent(s): 9e7e687

Upload results/h100_snapshot.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. results/h100_snapshot.md +104 -0
results/h100_snapshot.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # H100 results snapshot β€” 2026-04-27 04:10 UTC
2
+
3
+ Live as of this commit. All numbers are full-test (no smoke cuts).
4
+
5
+ ## Bench grid status
6
+
7
+ | Task | Prompt | n | Status | Wall-clock |
8
+ |---|---|---|---|---|
9
+ | T1 enhancer_generation | raw | 372,210 | DONE | finished earlier today |
10
+ | T1 enhancer_generation | enriched | 372,210 | DONE | finished earlier today |
11
+ | T2 pair_prediction | raw | 744,420 | DONE | finished earlier today |
12
+ | **T2 pair_prediction** | **enriched** | **744,420** | **DONE** | **just landed (~3.5h)** |
13
+ | T3 enhancer_editing | raw | (~372k) | RUNNING (PID 139902, 6 min in) | ~5 h ETA |
14
+ | T3 enhancer_editing | enriched | (~372k) | queued | ~5 h ETA |
15
+
16
+ ## T1 β€” full 372k, 7-cell breakdown
17
+
18
+ `runs/exp_t1_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
19
+
20
+ | Metric | zs_raw | zs_enriched |
21
+ |---|---|---|
22
+ | `parse_rate` | 0.9996 | 0.9997 |
23
+ | `mean_gc_abs_err` | 0.116 | 0.126 |
24
+ | `mean_length_ratio` | **1.64** | **1.67** |
25
+
26
+ Per-cell (n shown for context):
27
+
28
+ | Cell | n | zs_raw len_ratio | zs_enriched len_ratio | gc_err raw | gc_err enriched |
29
+ |---|---|---|---|---|---|
30
+ | Ex | 86,088 | 1.63 | 1.67 | 0.115 | 0.128 |
31
+ | Mic | 74,828 | 1.64 | 1.69 | 0.113 | 0.123 |
32
+ | Oli | 63,278 | 1.64 | 1.68 | 0.119 | 0.124 |
33
+ | In | 50,872 | 1.63 | 1.65 | 0.116 | 0.128 |
34
+ | Ast | 48,623 | 1.64 | 1.66 | 0.116 | 0.125 |
35
+ | OPC | 40,162 | 1.64 | 1.66 | 0.115 | 0.122 |
36
+ | End | 8,359 | 1.65 | 1.70 | 0.118 | 0.137 |
37
+
38
+ **Reading**: zero-shot Qwen3.5-2B over-generates by ~65% (length_ratio
39
+ 1.64–1.67 vs target 1.00). Tool-enriched is slightly worse on length
40
+ (1.67 vs 1.64) and slightly worse on GC error (0.126 vs 0.116) β€”
41
+ adding the tool_context block confuses the small model rather than
42
+ helping it.
43
+
44
+ **Next number to land**: post-bench Stage 1 will train the T1
45
+ fusion-SFT adapter, then Stage 1b runs `predict_fusion.py` +
46
+ `run_generation_eval.py` + `eval_t3_oracle.py` (no, T1 not T3 β€” sorry,
47
+ just `run_generation_eval.py` for FBD/spec/argmax) on the trained
48
+ adapter β†’ first oracle metrics for T1 fusion-SFT will be in
49
+ `runs/exp_t1_fusion_sft_20260427_h100/predict_t1_{raw,enriched}/genqual/genqual.json`.
50
+
51
+ ## T2 β€” full 744k, 7-cell breakdown
52
+
53
+ `runs/exp_t2_grid_separatedQA_20260426_h100_vllm_full/zs_{raw,enriched}/metrics.json`
54
+
55
+ | Metric | zs_raw | zs_enriched |
56
+ |---|---|---|
57
+ | `accuracy` | 0.500 | **0.500** |
58
+ | `f1` | 0.0001 | **0.002** |
59
+ | `precision` | 0.65 | 0.58 |
60
+ | `recall` | 0.00003 | **0.001** |
61
+ | `parse_rate` | 1.000 | 1.000 |
62
+
63
+ Per-cell precision (zs_enriched):
64
+
65
+ | Cell | n | precision | recall |
66
+ |---|---|---|---|
67
+ | Oli | 126,556 | 0.68 | 0.0005 |
68
+ | Ex | 172,176 | 0.64 | 0.0015 |
69
+ | OPC | 80,324 | 0.64 | 0.0002 |
70
+ | End | 16,718 | 0.56 | 0.0011 |
71
+ | Mic | 149,656 | 0.55 | 0.0004 |
72
+ | In | 101,744 | 0.54 | 0.0011 |
73
+ | Ast | 97,246 | 0.51 | 0.0021 |
74
+
75
+ **Reading**: zero-shot is **degenerate** β€” almost always predicts
76
+ `not_paired`, gets recall ~0.001 across all cells. Tool-enriched
77
+ slightly better (F1 0.002 vs 0.0001) but still effectively useless
78
+ without fine-tuning. Also: per the 04:00 lab message Β§2, the **T2
79
+ enhancer side has no TFBS scan in the prod tool_context** β€” the model
80
+ is being asked to reason about pairing using only the promoter's
81
+ TFBS. That's the bigger fix; needs galaxy-side regen.
82
+
83
+ ## T3 β€” pending
84
+
85
+ T3 zs_raw started at 04:04 UTC (running now). After both T3 zs benches
86
+ finish, post_bench_pipeline.sh fires automatically and the v5 chain
87
+ runs (Stages 0c, 1, 1b, 2, 2b, 3, 3b, 3c, 3d, 4, 4b Γ—3, 5, 6, 7).
88
+
89
+ ## Side artefacts
90
+
91
+ * T1 reasoning expansion (Ling-2.6-1T) β€” 146/333 done at 04:10 UTC,
92
+ ~30 min remaining. `data/reasoning_traces/train.enhancer_generation.reasoning.jsonl`.
93
+ * Multi-turn RFT + SV-GSPO reward fix (commits `25504fd`, `e133cf1`).
94
+ * SFT collator now sanitises before tokenisation (commit `bda9ee0`,
95
+ pre-flighted before Stages 1–4 fire β€” without this, every fusion-SFT
96
+ run would have trained on leaky data).
97
+
98
+ ## What's NOT yet in the chain
99
+
100
+ * T3-RFT-from-joint ablation β€” see `t3_post_v5_followups.md` Β§1; lab
101
+ GPU welcome.
102
+ * Loop-SFT on post-RFT JSONL β€” see `t3_post_v5_followups.md` Β§3.
103
+ * T2 enhancer TFBS scan regen β€” needs galaxy CPU; see lab message v2 Β§2.
104
+ * TACO + HyenaDNA external baselines β€” see lab message v2 Β§5.