T3 metrics β quick reference
This is the one-pager you read when looking at a T3 results file.
Authoritative implementation: scripts/eval_t3_oracle.py. Design
rationale (why these and not heuristic-overlap): t3_evaluation_design.md.
Per-row binary outcomes
For each test row the scorer emits these flags:
| Flag | Definition | Notes |
|---|---|---|
within_budget |
hamming(pred, ref) β€ row.metadata.edit_budget |
Row's own assigned budget, typically 10 bp absolute. |
length_preserved |
len(pred) == len(ref) |
Reference is usually 500 bp. |
target_motif_present |
IUPAC regex match for row.metadata.target_motif in pred |
Forward + reverse-complement scan. |
objective_success |
Depends on row.metadata.edit_type (see below). |
The headline "did this edit do its job" flag. |
objective_success per edit_type:
edit_type |
objective_success true iff β¦ |
|---|---|
activity_boost |
pred_activity_src > ref_activity_src (oracle activity in source cell type went up) |
cell_type_transfer |
(pred_tgt β pred_src) β (ref_tgt β ref_src) > 0 (relative shift toward target cell type increased) |
promoter_retarget |
target_motif_present (the new TF motif landed in the sequence) |
Per-row continuous values
| Field | What it measures |
|---|---|
edit_distance |
Absolute Hamming distance pred β ref, in bp. |
edit_distance_pct |
edit_distance / len(ref) β fraction of bases changed. |
pred_activity_src, pred_activity_tgt |
Oracle activity scores for pred in source / target cell type. |
ref_activity_src, ref_activity_tgt |
Oracle activity scores for ref in source / target cell type. |
activity_delta_src |
pred_activity_src β ref_activity_src (used for activity_boost). |
activity_relative_shift |
(pred_tgt β pred_src) β (ref_tgt β ref_src) (used for cell_type_transfer). |
Aggregate metrics (paper-table column candidates)
mean_* for each binary flag is the obvious "fraction of rows where the
flag fired". The interesting non-trivial aggregates are:
in_budget_at_5pct / in_budget_at_10pct / in_budget_at_20pct
This is the "percentage of edit distance" you asked about.
Definition: fraction of rows where edit_distance β€ X% of len(ref),
ignoring the row's own edit_budget. Lets us compare across rows
with different assigned budgets.
For a 500 bp reference enhancer:
| Threshold | Bp budget | Interpretation |
|---|---|---|
in_budget_at_5pct |
β€ 25 bp | Minimal, near-surgical edit. |
in_budget_at_10pct |
β€ 50 bp | Moderate edit. |
in_budget_at_20pct |
β€ 100 bp | Substantial edit (model is rewriting). |
So in_budget_at_5pct = 0.85 means 85% of the model's edits change β€
5% of the sequence, i.e. the model is making minimal, focused changes.
Why three thresholds? Different downstream applications care about different "small": a SNP-style retarget is OK with 5%, a CRE-style rewrite might allow 20%. Reporting all three lets reviewers pick the threshold that matches their bias. Paper precedent: Lin et al., NeurIPS 2024.
within_budget (no _at_pct suffix) is distinct β it uses the
row's own assigned edit_budget (typically 10 bp absolute).
within_budget and in_budget_at_5pct can disagree:
within_budget=False, in_budget_at_5pct=Trueβ edit was 12 bp; row's budget was 10 (fail) but 5% of 500 = 25 (pass).within_budget=True, in_budget_at_5pct=Falseβ only happens when the row's budget exceeds 5% of len(ref); rare for our prod data.
kmer6_diversity
Fraction of unique 6-mers across the predicted sequences for the cohort (across rows). Catches "the model collapsed to one motif" mode failure.
transfer_specificity (cell_type_transfer rows only)
Fraction of cell_type_transfer rows where the prediction is both:
- more active in target than in source, and
- more active in target than the reference was in target.
Both required because activating the target alone could still fail the "transfer" intent (the original could already have been more active in target).
mean_target_motif_pwm_present, pwm_n_observed
Optional supplementary check using a real PWM scan against
--meme-file. Falls back to None when the meme database isn't on
disk. Confirms the IUPAC regex match isn't a false positive on a
random A/C/G/T match.
Per-cell breakdown
per_cell_type repeats the aggregates above with rows bucketed by
row.metadata.cell_type (Ex / In / OPC / Ast / Oli / Mic / End). Lets
us see whether the model is uniformly OK across cells or biased toward
the over-represented Ex cell type.
RFT-specific multi-turn metadata
scripts/rft_t3.py (with --rounds R > 1) adds these fields to each
output row's metadata:
| Field | Meaning |
|---|---|
rft_rounds_used |
How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). |
rft_total_candidates |
Total candidates the model produced for this row across all rounds. |
rft_winner_round |
Round index (0-based) that produced the chosen candidate. |
rft_winner_margin |
The objective margin of the chosen candidate (per edit_type: activity_delta_src for boost, activity_relative_shift for transfer, 1.0 for retarget). |
rft_winner_edit_distance |
Hamming distance of the chosen candidate to ref (bp). |
rft_winner_edit_distance_pct |
Same as a fraction of len(ref). |
rft_source |
"candidate" if oracle picked a winner; "heuristic_fallback" if no candidate satisfied all constraints and we kept the heuristic gold. |
Read these to track:
- keep-rate = fraction with
rft_source=="candidate" - mean rounds-to-success =
mean(rft_rounds_used | rft_source=="candidate") - margin distribution =
rft_winner_marginhistogram
Where to look
| File | What's in it |
|---|---|
runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json |
Aggregate + per-cell oracle metrics for the trained adapter on the test set. |
runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json |
Same metrics for the zero-shot LLM baseline. |
runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl |
Post-RFT training JSONL. Inspect metadata.rft_* fields per row to see how many rounds each row needed and the winner margins. |