explcre
/

dnathinker-checkpoints

Model card Files Files and versions

xet

Community

explcre commited on Apr 27

Commit

9bd3c19

verified ·

1 Parent(s): 9dc753a

Upload docs/t3_metrics_quickref.md with huggingface_hub

Browse files

Files changed (1) hide show

docs/t3_metrics_quickref.md +131 -0

docs/t3_metrics_quickref.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# T3 metrics — quick reference
+This is the one-pager you read when looking at a T3 results file.
+Authoritative implementation: `scripts/eval_t3_oracle.py`. Design
+rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`.
+## Per-row binary outcomes
+For each test row the scorer emits these flags:
+| Flag | Definition | Notes |
+|---|---|---|
+| `within_budget` | `hamming(pred, ref) ≤ row.metadata.edit_budget` | Row's own assigned budget, typically 10 bp absolute. |
+| `length_preserved` | `len(pred) == len(ref)` | Reference is usually 500 bp. |
+| `target_motif_present` | IUPAC regex match for `row.metadata.target_motif` in `pred` | Forward + reverse-complement scan. |
+| `objective_success` | Depends on `row.metadata.edit_type` (see below). | The headline "did this edit do its job" flag. |
+`objective_success` per `edit_type`:
+| `edit_type` | `objective_success` true iff … |
+|---|---|
+| `activity_boost` | `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) |
+| `cell_type_transfer` | `(pred_tgt − pred_src) − (ref_tgt − ref_src) > 0` (relative shift toward target cell type increased) |
+| `promoter_retarget` | `target_motif_present` (the new TF motif landed in the sequence) |
+## Per-row continuous values
+| Field | What it measures |
+|---|---|
+| `edit_distance` | Absolute Hamming distance pred ↔ ref, in bp. |
+| `edit_distance_pct` | `edit_distance / len(ref)` — fraction of bases changed. |
+| `pred_activity_src`, `pred_activity_tgt` | Oracle activity scores for `pred` in source / target cell type. |
+| `ref_activity_src`, `ref_activity_tgt` | Oracle activity scores for `ref` in source / target cell type. |
+| `activity_delta_src` | `pred_activity_src − ref_activity_src` (used for `activity_boost`). |
+| `activity_relative_shift` | `(pred_tgt − pred_src) − (ref_tgt − ref_src)` (used for `cell_type_transfer`). |
+## Aggregate metrics (paper-table column candidates)
+`mean_*` for each binary flag is the obvious "fraction of rows where the
+flag fired". The interesting non-trivial aggregates are:
+### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`
+**This is the "percentage of edit distance" you asked about.**
+Definition: fraction of rows where `edit_distance ≤ X% of len(ref)`,
+**ignoring** the row's own `edit_budget`. Lets us compare across rows
+with different assigned budgets.
+For a 500 bp reference enhancer:
+| Threshold | Bp budget | Interpretation |
+|---|---|---|
+| `in_budget_at_5pct` | ≤ 25 bp | Minimal, near-surgical edit. |
+| `in_budget_at_10pct` | ≤ 50 bp | Moderate edit. |
+| `in_budget_at_20pct` | ≤ 100 bp | Substantial edit (model is rewriting). |
+So `in_budget_at_5pct = 0.85` means **85% of the model's edits change ≤
+5% of the sequence**, i.e. the model is making minimal, focused changes.
+Why three thresholds? Different downstream applications care about
+different "small": a SNP-style retarget is OK with 5%, a CRE-style
+rewrite might allow 20%. Reporting all three lets reviewers pick the
+threshold that matches their bias. Paper precedent: Lin et al.,
+NeurIPS 2024.
+`within_budget` (no `_at_pct` suffix) is **distinct** — it uses the
+row's own assigned `edit_budget` (typically 10 bp absolute).
+`within_budget` and `in_budget_at_5pct` can disagree:
+* `within_budget=False, in_budget_at_5pct=True` — edit was 12 bp; row's
+  budget was 10 (fail) but 5% of 500 = 25 (pass).
+* `within_budget=True, in_budget_at_5pct=False` — only happens when the
+  row's budget exceeds 5% of len(ref); rare for our prod data.
+### `kmer6_diversity`
+Fraction of unique 6-mers across the predicted sequences for the cohort
+(across rows). Catches "the model collapsed to one motif" mode failure.
+### `transfer_specificity` (cell_type_transfer rows only)
+Fraction of `cell_type_transfer` rows where the prediction is **both**:
+* more active in target than in source, **and**
+* more active in target than the reference was in target.
+Both required because activating the target alone could still fail the
+"transfer" intent (the original could already have been more active in
+target).
+### `mean_target_motif_pwm_present`, `pwm_n_observed`
+Optional supplementary check using a real PWM scan against
+`--meme-file`. Falls back to None when the meme database isn't on
+disk. Confirms the IUPAC regex match isn't a false positive on a
+random A/C/G/T match.
+## Per-cell breakdown
+`per_cell_type` repeats the aggregates above with rows bucketed by
+`row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets
+us see whether the model is uniformly OK across cells or biased toward
+the over-represented Ex cell type.
+## RFT-specific multi-turn metadata
+`scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each
+output row's `metadata`:
+| Field | Meaning |
+|---|---|
+| `rft_rounds_used` | How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). |
+| `rft_total_candidates` | Total candidates the model produced for this row across all rounds. |
+| `rft_winner_round` | Round index (0-based) that produced the chosen candidate. |
+| `rft_winner_margin` | The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). |
+| `rft_winner_edit_distance` | Hamming distance of the chosen candidate to ref (bp). |
+| `rft_winner_edit_distance_pct` | Same as a fraction of `len(ref)`. |
+| `rft_source` | `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. |
+Read these to track:
+* keep-rate = fraction with `rft_source=="candidate"`
+* mean rounds-to-success = `mean(rft_rounds_used | rft_source=="candidate")`
+* margin distribution = `rft_winner_margin` histogram
+## Where to look
+| File | What's in it |
+|---|---|
+| `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` | Aggregate + per-cell oracle metrics for the trained adapter on the test set. |
+| `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` | Same metrics for the zero-shot LLM baseline. |
+| `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` | Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. |