Upload docs/t3_metrics_quickref.md with huggingface_hub
Browse files- docs/t3_metrics_quickref.md +131 -0
docs/t3_metrics_quickref.md
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# T3 metrics β quick reference
|
| 2 |
+
|
| 3 |
+
This is the one-pager you read when looking at a T3 results file.
|
| 4 |
+
Authoritative implementation: `scripts/eval_t3_oracle.py`. Design
|
| 5 |
+
rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`.
|
| 6 |
+
|
| 7 |
+
## Per-row binary outcomes
|
| 8 |
+
|
| 9 |
+
For each test row the scorer emits these flags:
|
| 10 |
+
|
| 11 |
+
| Flag | Definition | Notes |
|
| 12 |
+
|---|---|---|
|
| 13 |
+
| `within_budget` | `hamming(pred, ref) β€ row.metadata.edit_budget` | Row's own assigned budget, typically 10 bp absolute. |
|
| 14 |
+
| `length_preserved` | `len(pred) == len(ref)` | Reference is usually 500 bp. |
|
| 15 |
+
| `target_motif_present` | IUPAC regex match for `row.metadata.target_motif` in `pred` | Forward + reverse-complement scan. |
|
| 16 |
+
| `objective_success` | Depends on `row.metadata.edit_type` (see below). | The headline "did this edit do its job" flag. |
|
| 17 |
+
|
| 18 |
+
`objective_success` per `edit_type`:
|
| 19 |
+
|
| 20 |
+
| `edit_type` | `objective_success` true iff β¦ |
|
| 21 |
+
|---|---|
|
| 22 |
+
| `activity_boost` | `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) |
|
| 23 |
+
| `cell_type_transfer` | `(pred_tgt β pred_src) β (ref_tgt β ref_src) > 0` (relative shift toward target cell type increased) |
|
| 24 |
+
| `promoter_retarget` | `target_motif_present` (the new TF motif landed in the sequence) |
|
| 25 |
+
|
| 26 |
+
## Per-row continuous values
|
| 27 |
+
|
| 28 |
+
| Field | What it measures |
|
| 29 |
+
|---|---|
|
| 30 |
+
| `edit_distance` | Absolute Hamming distance pred β ref, in bp. |
|
| 31 |
+
| `edit_distance_pct` | `edit_distance / len(ref)` β fraction of bases changed. |
|
| 32 |
+
| `pred_activity_src`, `pred_activity_tgt` | Oracle activity scores for `pred` in source / target cell type. |
|
| 33 |
+
| `ref_activity_src`, `ref_activity_tgt` | Oracle activity scores for `ref` in source / target cell type. |
|
| 34 |
+
| `activity_delta_src` | `pred_activity_src β ref_activity_src` (used for `activity_boost`). |
|
| 35 |
+
| `activity_relative_shift` | `(pred_tgt β pred_src) β (ref_tgt β ref_src)` (used for `cell_type_transfer`). |
|
| 36 |
+
|
| 37 |
+
## Aggregate metrics (paper-table column candidates)
|
| 38 |
+
|
| 39 |
+
`mean_*` for each binary flag is the obvious "fraction of rows where the
|
| 40 |
+
flag fired". The interesting non-trivial aggregates are:
|
| 41 |
+
|
| 42 |
+
### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`
|
| 43 |
+
|
| 44 |
+
**This is the "percentage of edit distance" you asked about.**
|
| 45 |
+
|
| 46 |
+
Definition: fraction of rows where `edit_distance β€ X% of len(ref)`,
|
| 47 |
+
**ignoring** the row's own `edit_budget`. Lets us compare across rows
|
| 48 |
+
with different assigned budgets.
|
| 49 |
+
|
| 50 |
+
For a 500 bp reference enhancer:
|
| 51 |
+
|
| 52 |
+
| Threshold | Bp budget | Interpretation |
|
| 53 |
+
|---|---|---|
|
| 54 |
+
| `in_budget_at_5pct` | β€ 25 bp | Minimal, near-surgical edit. |
|
| 55 |
+
| `in_budget_at_10pct` | β€ 50 bp | Moderate edit. |
|
| 56 |
+
| `in_budget_at_20pct` | β€ 100 bp | Substantial edit (model is rewriting). |
|
| 57 |
+
|
| 58 |
+
So `in_budget_at_5pct = 0.85` means **85% of the model's edits change β€
|
| 59 |
+
5% of the sequence**, i.e. the model is making minimal, focused changes.
|
| 60 |
+
|
| 61 |
+
Why three thresholds? Different downstream applications care about
|
| 62 |
+
different "small": a SNP-style retarget is OK with 5%, a CRE-style
|
| 63 |
+
rewrite might allow 20%. Reporting all three lets reviewers pick the
|
| 64 |
+
threshold that matches their bias. Paper precedent: Lin et al.,
|
| 65 |
+
NeurIPS 2024.
|
| 66 |
+
|
| 67 |
+
`within_budget` (no `_at_pct` suffix) is **distinct** β it uses the
|
| 68 |
+
row's own assigned `edit_budget` (typically 10 bp absolute).
|
| 69 |
+
`within_budget` and `in_budget_at_5pct` can disagree:
|
| 70 |
+
|
| 71 |
+
* `within_budget=False, in_budget_at_5pct=True` β edit was 12 bp; row's
|
| 72 |
+
budget was 10 (fail) but 5% of 500 = 25 (pass).
|
| 73 |
+
* `within_budget=True, in_budget_at_5pct=False` β only happens when the
|
| 74 |
+
row's budget exceeds 5% of len(ref); rare for our prod data.
|
| 75 |
+
|
| 76 |
+
### `kmer6_diversity`
|
| 77 |
+
|
| 78 |
+
Fraction of unique 6-mers across the predicted sequences for the cohort
|
| 79 |
+
(across rows). Catches "the model collapsed to one motif" mode failure.
|
| 80 |
+
|
| 81 |
+
### `transfer_specificity` (cell_type_transfer rows only)
|
| 82 |
+
|
| 83 |
+
Fraction of `cell_type_transfer` rows where the prediction is **both**:
|
| 84 |
+
* more active in target than in source, **and**
|
| 85 |
+
* more active in target than the reference was in target.
|
| 86 |
+
|
| 87 |
+
Both required because activating the target alone could still fail the
|
| 88 |
+
"transfer" intent (the original could already have been more active in
|
| 89 |
+
target).
|
| 90 |
+
|
| 91 |
+
### `mean_target_motif_pwm_present`, `pwm_n_observed`
|
| 92 |
+
|
| 93 |
+
Optional supplementary check using a real PWM scan against
|
| 94 |
+
`--meme-file`. Falls back to None when the meme database isn't on
|
| 95 |
+
disk. Confirms the IUPAC regex match isn't a false positive on a
|
| 96 |
+
random A/C/G/T match.
|
| 97 |
+
|
| 98 |
+
## Per-cell breakdown
|
| 99 |
+
|
| 100 |
+
`per_cell_type` repeats the aggregates above with rows bucketed by
|
| 101 |
+
`row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets
|
| 102 |
+
us see whether the model is uniformly OK across cells or biased toward
|
| 103 |
+
the over-represented Ex cell type.
|
| 104 |
+
|
| 105 |
+
## RFT-specific multi-turn metadata
|
| 106 |
+
|
| 107 |
+
`scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each
|
| 108 |
+
output row's `metadata`:
|
| 109 |
+
|
| 110 |
+
| Field | Meaning |
|
| 111 |
+
|---|---|
|
| 112 |
+
| `rft_rounds_used` | How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). |
|
| 113 |
+
| `rft_total_candidates` | Total candidates the model produced for this row across all rounds. |
|
| 114 |
+
| `rft_winner_round` | Round index (0-based) that produced the chosen candidate. |
|
| 115 |
+
| `rft_winner_margin` | The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). |
|
| 116 |
+
| `rft_winner_edit_distance` | Hamming distance of the chosen candidate to ref (bp). |
|
| 117 |
+
| `rft_winner_edit_distance_pct` | Same as a fraction of `len(ref)`. |
|
| 118 |
+
| `rft_source` | `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. |
|
| 119 |
+
|
| 120 |
+
Read these to track:
|
| 121 |
+
* keep-rate = fraction with `rft_source=="candidate"`
|
| 122 |
+
* mean rounds-to-success = `mean(rft_rounds_used | rft_source=="candidate")`
|
| 123 |
+
* margin distribution = `rft_winner_margin` histogram
|
| 124 |
+
|
| 125 |
+
## Where to look
|
| 126 |
+
|
| 127 |
+
| File | What's in it |
|
| 128 |
+
|---|---|
|
| 129 |
+
| `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` | Aggregate + per-cell oracle metrics for the trained adapter on the test set. |
|
| 130 |
+
| `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` | Same metrics for the zero-shot LLM baseline. |
|
| 131 |
+
| `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` | Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. |
|