explcre commited on
Commit
9bd3c19
Β·
verified Β·
1 Parent(s): 9dc753a

Upload docs/t3_metrics_quickref.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/t3_metrics_quickref.md +131 -0
docs/t3_metrics_quickref.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # T3 metrics β€” quick reference
2
+
3
+ This is the one-pager you read when looking at a T3 results file.
4
+ Authoritative implementation: `scripts/eval_t3_oracle.py`. Design
5
+ rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`.
6
+
7
+ ## Per-row binary outcomes
8
+
9
+ For each test row the scorer emits these flags:
10
+
11
+ | Flag | Definition | Notes |
12
+ |---|---|---|
13
+ | `within_budget` | `hamming(pred, ref) ≀ row.metadata.edit_budget` | Row's own assigned budget, typically 10 bp absolute. |
14
+ | `length_preserved` | `len(pred) == len(ref)` | Reference is usually 500 bp. |
15
+ | `target_motif_present` | IUPAC regex match for `row.metadata.target_motif` in `pred` | Forward + reverse-complement scan. |
16
+ | `objective_success` | Depends on `row.metadata.edit_type` (see below). | The headline "did this edit do its job" flag. |
17
+
18
+ `objective_success` per `edit_type`:
19
+
20
+ | `edit_type` | `objective_success` true iff … |
21
+ |---|---|
22
+ | `activity_boost` | `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) |
23
+ | `cell_type_transfer` | `(pred_tgt βˆ’ pred_src) βˆ’ (ref_tgt βˆ’ ref_src) > 0` (relative shift toward target cell type increased) |
24
+ | `promoter_retarget` | `target_motif_present` (the new TF motif landed in the sequence) |
25
+
26
+ ## Per-row continuous values
27
+
28
+ | Field | What it measures |
29
+ |---|---|
30
+ | `edit_distance` | Absolute Hamming distance pred ↔ ref, in bp. |
31
+ | `edit_distance_pct` | `edit_distance / len(ref)` β€” fraction of bases changed. |
32
+ | `pred_activity_src`, `pred_activity_tgt` | Oracle activity scores for `pred` in source / target cell type. |
33
+ | `ref_activity_src`, `ref_activity_tgt` | Oracle activity scores for `ref` in source / target cell type. |
34
+ | `activity_delta_src` | `pred_activity_src βˆ’ ref_activity_src` (used for `activity_boost`). |
35
+ | `activity_relative_shift` | `(pred_tgt βˆ’ pred_src) βˆ’ (ref_tgt βˆ’ ref_src)` (used for `cell_type_transfer`). |
36
+
37
+ ## Aggregate metrics (paper-table column candidates)
38
+
39
+ `mean_*` for each binary flag is the obvious "fraction of rows where the
40
+ flag fired". The interesting non-trivial aggregates are:
41
+
42
+ ### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`
43
+
44
+ **This is the "percentage of edit distance" you asked about.**
45
+
46
+ Definition: fraction of rows where `edit_distance ≀ X% of len(ref)`,
47
+ **ignoring** the row's own `edit_budget`. Lets us compare across rows
48
+ with different assigned budgets.
49
+
50
+ For a 500 bp reference enhancer:
51
+
52
+ | Threshold | Bp budget | Interpretation |
53
+ |---|---|---|
54
+ | `in_budget_at_5pct` | ≀ 25 bp | Minimal, near-surgical edit. |
55
+ | `in_budget_at_10pct` | ≀ 50 bp | Moderate edit. |
56
+ | `in_budget_at_20pct` | ≀ 100 bp | Substantial edit (model is rewriting). |
57
+
58
+ So `in_budget_at_5pct = 0.85` means **85% of the model's edits change ≀
59
+ 5% of the sequence**, i.e. the model is making minimal, focused changes.
60
+
61
+ Why three thresholds? Different downstream applications care about
62
+ different "small": a SNP-style retarget is OK with 5%, a CRE-style
63
+ rewrite might allow 20%. Reporting all three lets reviewers pick the
64
+ threshold that matches their bias. Paper precedent: Lin et al.,
65
+ NeurIPS 2024.
66
+
67
+ `within_budget` (no `_at_pct` suffix) is **distinct** β€” it uses the
68
+ row's own assigned `edit_budget` (typically 10 bp absolute).
69
+ `within_budget` and `in_budget_at_5pct` can disagree:
70
+
71
+ * `within_budget=False, in_budget_at_5pct=True` β€” edit was 12 bp; row's
72
+ budget was 10 (fail) but 5% of 500 = 25 (pass).
73
+ * `within_budget=True, in_budget_at_5pct=False` β€” only happens when the
74
+ row's budget exceeds 5% of len(ref); rare for our prod data.
75
+
76
+ ### `kmer6_diversity`
77
+
78
+ Fraction of unique 6-mers across the predicted sequences for the cohort
79
+ (across rows). Catches "the model collapsed to one motif" mode failure.
80
+
81
+ ### `transfer_specificity` (cell_type_transfer rows only)
82
+
83
+ Fraction of `cell_type_transfer` rows where the prediction is **both**:
84
+ * more active in target than in source, **and**
85
+ * more active in target than the reference was in target.
86
+
87
+ Both required because activating the target alone could still fail the
88
+ "transfer" intent (the original could already have been more active in
89
+ target).
90
+
91
+ ### `mean_target_motif_pwm_present`, `pwm_n_observed`
92
+
93
+ Optional supplementary check using a real PWM scan against
94
+ `--meme-file`. Falls back to None when the meme database isn't on
95
+ disk. Confirms the IUPAC regex match isn't a false positive on a
96
+ random A/C/G/T match.
97
+
98
+ ## Per-cell breakdown
99
+
100
+ `per_cell_type` repeats the aggregates above with rows bucketed by
101
+ `row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets
102
+ us see whether the model is uniformly OK across cells or biased toward
103
+ the over-represented Ex cell type.
104
+
105
+ ## RFT-specific multi-turn metadata
106
+
107
+ `scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each
108
+ output row's `metadata`:
109
+
110
+ | Field | Meaning |
111
+ |---|---|
112
+ | `rft_rounds_used` | How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). |
113
+ | `rft_total_candidates` | Total candidates the model produced for this row across all rounds. |
114
+ | `rft_winner_round` | Round index (0-based) that produced the chosen candidate. |
115
+ | `rft_winner_margin` | The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). |
116
+ | `rft_winner_edit_distance` | Hamming distance of the chosen candidate to ref (bp). |
117
+ | `rft_winner_edit_distance_pct` | Same as a fraction of `len(ref)`. |
118
+ | `rft_source` | `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. |
119
+
120
+ Read these to track:
121
+ * keep-rate = fraction with `rft_source=="candidate"`
122
+ * mean rounds-to-success = `mean(rft_rounds_used | rft_source=="candidate")`
123
+ * margin distribution = `rft_winner_margin` histogram
124
+
125
+ ## Where to look
126
+
127
+ | File | What's in it |
128
+ |---|---|
129
+ | `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` | Aggregate + per-cell oracle metrics for the trained adapter on the test set. |
130
+ | `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` | Same metrics for the zero-shot LLM baseline. |
131
+ | `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` | Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. |