| # T3 metrics β quick reference |
|
|
| This is the one-pager you read when looking at a T3 results file. |
| Authoritative implementation: `scripts/eval_t3_oracle.py`. Design |
| rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`. |
|
|
| ## Per-row binary outcomes |
|
|
| For each test row the scorer emits these flags: |
|
|
| | Flag | Definition | Notes | |
| |---|---|---| |
| | `within_budget` | `hamming(pred, ref) β€ row.metadata.edit_budget` | Row's own assigned budget, typically 10 bp absolute. | |
| | `length_preserved` | `len(pred) == len(ref)` | Reference is usually 500 bp. | |
| | `target_motif_present` | IUPAC regex match for `row.metadata.target_motif` in `pred` | Forward + reverse-complement scan. | |
| | `objective_success` | Depends on `row.metadata.edit_type` (see below). | The headline "did this edit do its job" flag. | |
|
|
| `objective_success` per `edit_type`: |
|
|
| | `edit_type` | `objective_success` true iff β¦ | |
| |---|---| |
| | `activity_boost` | `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) | |
| | `cell_type_transfer` | `(pred_tgt β pred_src) β (ref_tgt β ref_src) > 0` (relative shift toward target cell type increased) | |
| | `promoter_retarget` | `target_motif_present` (the new TF motif landed in the sequence) | |
|
|
| ## Per-row continuous values |
|
|
| | Field | What it measures | |
| |---|---| |
| | `edit_distance` | Absolute Hamming distance pred β ref, in bp. | |
| | `edit_distance_pct` | `edit_distance / len(ref)` β fraction of bases changed. | |
| | `pred_activity_src`, `pred_activity_tgt` | Oracle activity scores for `pred` in source / target cell type. | |
| | `ref_activity_src`, `ref_activity_tgt` | Oracle activity scores for `ref` in source / target cell type. | |
| | `activity_delta_src` | `pred_activity_src β ref_activity_src` (used for `activity_boost`). | |
| | `activity_relative_shift` | `(pred_tgt β pred_src) β (ref_tgt β ref_src)` (used for `cell_type_transfer`). | |
|
|
| ## Aggregate metrics (paper-table column candidates) |
|
|
| `mean_*` for each binary flag is the obvious "fraction of rows where the |
| flag fired". The interesting non-trivial aggregates are: |
|
|
| ### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct` |
| |
| **This is the "percentage of edit distance" you asked about.** |
| |
| Definition: fraction of rows where `edit_distance β€ X% of len(ref)`, |
| **ignoring** the row's own `edit_budget`. Lets us compare across rows |
| with different assigned budgets. |
|
|
| For a 500 bp reference enhancer: |
|
|
| | Threshold | Bp budget | Interpretation | |
| |---|---|---| |
| | `in_budget_at_5pct` | β€ 25 bp | Minimal, near-surgical edit. | |
| | `in_budget_at_10pct` | β€ 50 bp | Moderate edit. | |
| | `in_budget_at_20pct` | β€ 100 bp | Substantial edit (model is rewriting). | |
|
|
| So `in_budget_at_5pct = 0.85` means **85% of the model's edits change β€ |
| 5% of the sequence**, i.e. the model is making minimal, focused changes. |
|
|
| Why three thresholds? Different downstream applications care about |
| different "small": a SNP-style retarget is OK with 5%, a CRE-style |
| rewrite might allow 20%. Reporting all three lets reviewers pick the |
| threshold that matches their bias. Paper precedent: Lin et al., |
| NeurIPS 2024. |
|
|
| `within_budget` (no `_at_pct` suffix) is **distinct** β it uses the |
| row's own assigned `edit_budget` (typically 10 bp absolute). |
| `within_budget` and `in_budget_at_5pct` can disagree: |
|
|
| * `within_budget=False, in_budget_at_5pct=True` β edit was 12 bp; row's |
| budget was 10 (fail) but 5% of 500 = 25 (pass). |
| * `within_budget=True, in_budget_at_5pct=False` β only happens when the |
| row's budget exceeds 5% of len(ref); rare for our prod data. |
|
|
| ### `kmer6_diversity` |
| |
| Fraction of unique 6-mers across the predicted sequences for the cohort |
| (across rows). Catches "the model collapsed to one motif" mode failure. |
| |
| ### `transfer_specificity` (cell_type_transfer rows only) |
|
|
| Fraction of `cell_type_transfer` rows where the prediction is **both**: |
| * more active in target than in source, **and** |
| * more active in target than the reference was in target. |
|
|
| Both required because activating the target alone could still fail the |
| "transfer" intent (the original could already have been more active in |
| target). |
|
|
| ### `mean_target_motif_pwm_present`, `pwm_n_observed` |
|
|
| Optional supplementary check using a real PWM scan against |
| `--meme-file`. Falls back to None when the meme database isn't on |
| disk. Confirms the IUPAC regex match isn't a false positive on a |
| random A/C/G/T match. |
|
|
| ## Per-cell breakdown |
|
|
| `per_cell_type` repeats the aggregates above with rows bucketed by |
| `row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets |
| us see whether the model is uniformly OK across cells or biased toward |
| the over-represented Ex cell type. |
|
|
| ## RFT-specific multi-turn metadata |
|
|
| `scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each |
| output row's `metadata`: |
|
|
| | Field | Meaning | |
| |---|---| |
| | `rft_rounds_used` | How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). | |
| | `rft_total_candidates` | Total candidates the model produced for this row across all rounds. | |
| | `rft_winner_round` | Round index (0-based) that produced the chosen candidate. | |
| | `rft_winner_margin` | The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). | |
| | `rft_winner_edit_distance` | Hamming distance of the chosen candidate to ref (bp). | |
| | `rft_winner_edit_distance_pct` | Same as a fraction of `len(ref)`. | |
| | `rft_source` | `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. | |
|
|
| Read these to track: |
| * keep-rate = fraction with `rft_source=="candidate"` |
| * mean rounds-to-success = `mean(rft_rounds_used | rft_source=="candidate")` |
| * margin distribution = `rft_winner_margin` histogram |
|
|
| ## Where to look |
|
|
| | File | What's in it | |
| |---|---| |
| | `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` | Aggregate + per-cell oracle metrics for the trained adapter on the test set. | |
| | `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` | Same metrics for the zero-shot LLM baseline. | |
| | `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` | Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. | |
|
|