File size: 6,382 Bytes
9bd3c19 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | # T3 metrics β quick reference
This is the one-pager you read when looking at a T3 results file.
Authoritative implementation: `scripts/eval_t3_oracle.py`. Design
rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`.
## Per-row binary outcomes
For each test row the scorer emits these flags:
| Flag | Definition | Notes |
|---|---|---|
| `within_budget` | `hamming(pred, ref) β€ row.metadata.edit_budget` | Row's own assigned budget, typically 10 bp absolute. |
| `length_preserved` | `len(pred) == len(ref)` | Reference is usually 500 bp. |
| `target_motif_present` | IUPAC regex match for `row.metadata.target_motif` in `pred` | Forward + reverse-complement scan. |
| `objective_success` | Depends on `row.metadata.edit_type` (see below). | The headline "did this edit do its job" flag. |
`objective_success` per `edit_type`:
| `edit_type` | `objective_success` true iff β¦ |
|---|---|
| `activity_boost` | `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) |
| `cell_type_transfer` | `(pred_tgt β pred_src) β (ref_tgt β ref_src) > 0` (relative shift toward target cell type increased) |
| `promoter_retarget` | `target_motif_present` (the new TF motif landed in the sequence) |
## Per-row continuous values
| Field | What it measures |
|---|---|
| `edit_distance` | Absolute Hamming distance pred β ref, in bp. |
| `edit_distance_pct` | `edit_distance / len(ref)` β fraction of bases changed. |
| `pred_activity_src`, `pred_activity_tgt` | Oracle activity scores for `pred` in source / target cell type. |
| `ref_activity_src`, `ref_activity_tgt` | Oracle activity scores for `ref` in source / target cell type. |
| `activity_delta_src` | `pred_activity_src β ref_activity_src` (used for `activity_boost`). |
| `activity_relative_shift` | `(pred_tgt β pred_src) β (ref_tgt β ref_src)` (used for `cell_type_transfer`). |
## Aggregate metrics (paper-table column candidates)
`mean_*` for each binary flag is the obvious "fraction of rows where the
flag fired". The interesting non-trivial aggregates are:
### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`
**This is the "percentage of edit distance" you asked about.**
Definition: fraction of rows where `edit_distance β€ X% of len(ref)`,
**ignoring** the row's own `edit_budget`. Lets us compare across rows
with different assigned budgets.
For a 500 bp reference enhancer:
| Threshold | Bp budget | Interpretation |
|---|---|---|
| `in_budget_at_5pct` | β€ 25 bp | Minimal, near-surgical edit. |
| `in_budget_at_10pct` | β€ 50 bp | Moderate edit. |
| `in_budget_at_20pct` | β€ 100 bp | Substantial edit (model is rewriting). |
So `in_budget_at_5pct = 0.85` means **85% of the model's edits change β€
5% of the sequence**, i.e. the model is making minimal, focused changes.
Why three thresholds? Different downstream applications care about
different "small": a SNP-style retarget is OK with 5%, a CRE-style
rewrite might allow 20%. Reporting all three lets reviewers pick the
threshold that matches their bias. Paper precedent: Lin et al.,
NeurIPS 2024.
`within_budget` (no `_at_pct` suffix) is **distinct** β it uses the
row's own assigned `edit_budget` (typically 10 bp absolute).
`within_budget` and `in_budget_at_5pct` can disagree:
* `within_budget=False, in_budget_at_5pct=True` β edit was 12 bp; row's
budget was 10 (fail) but 5% of 500 = 25 (pass).
* `within_budget=True, in_budget_at_5pct=False` β only happens when the
row's budget exceeds 5% of len(ref); rare for our prod data.
### `kmer6_diversity`
Fraction of unique 6-mers across the predicted sequences for the cohort
(across rows). Catches "the model collapsed to one motif" mode failure.
### `transfer_specificity` (cell_type_transfer rows only)
Fraction of `cell_type_transfer` rows where the prediction is **both**:
* more active in target than in source, **and**
* more active in target than the reference was in target.
Both required because activating the target alone could still fail the
"transfer" intent (the original could already have been more active in
target).
### `mean_target_motif_pwm_present`, `pwm_n_observed`
Optional supplementary check using a real PWM scan against
`--meme-file`. Falls back to None when the meme database isn't on
disk. Confirms the IUPAC regex match isn't a false positive on a
random A/C/G/T match.
## Per-cell breakdown
`per_cell_type` repeats the aggregates above with rows bucketed by
`row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets
us see whether the model is uniformly OK across cells or biased toward
the over-represented Ex cell type.
## RFT-specific multi-turn metadata
`scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each
output row's `metadata`:
| Field | Meaning |
|---|---|
| `rft_rounds_used` | How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). |
| `rft_total_candidates` | Total candidates the model produced for this row across all rounds. |
| `rft_winner_round` | Round index (0-based) that produced the chosen candidate. |
| `rft_winner_margin` | The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). |
| `rft_winner_edit_distance` | Hamming distance of the chosen candidate to ref (bp). |
| `rft_winner_edit_distance_pct` | Same as a fraction of `len(ref)`. |
| `rft_source` | `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. |
Read these to track:
* keep-rate = fraction with `rft_source=="candidate"`
* mean rounds-to-success = `mean(rft_rounds_used | rft_source=="candidate")`
* margin distribution = `rft_winner_margin` histogram
## Where to look
| File | What's in it |
|---|---|
| `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` | Aggregate + per-cell oracle metrics for the trained adapter on the test set. |
| `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` | Same metrics for the zero-shot LLM baseline. |
| `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` | Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. |
|