dnathinker-checkpoints / docs /t3_metrics_quickref.md

Upload docs/t3_metrics_quickref.md with huggingface_hub

9bd3c19 verified 13 days ago

6.38 kB

	# T3 metrics — quick reference

	This is the one-pager you read when looking at a T3 results file.
	Authoritative implementation: `scripts/eval_t3_oracle.py`. Design
	rationale (why these and not heuristic-overlap): `t3_evaluation_design.md`.

	## Per-row binary outcomes

	For each test row the scorer emits these flags:

	\| Flag \| Definition \| Notes \|
	\|---\|---\|---\|
	\| `within_budget` \| `hamming(pred, ref) ≤ row.metadata.edit_budget` \| Row's own assigned budget, typically 10 bp absolute. \|
	\| `length_preserved` \| `len(pred) == len(ref)` \| Reference is usually 500 bp. \|
	\| `target_motif_present` \| IUPAC regex match for `row.metadata.target_motif` in `pred` \| Forward + reverse-complement scan. \|
	\| `objective_success` \| Depends on `row.metadata.edit_type` (see below). \| The headline "did this edit do its job" flag. \|

	`objective_success` per `edit_type`:

	\| `edit_type` \| `objective_success` true iff … \|
	\|---\|---\|
	\| `activity_boost` \| `pred_activity_src > ref_activity_src` (oracle activity in source cell type went up) \|
	\| `cell_type_transfer` \| `(pred_tgt − pred_src) − (ref_tgt − ref_src) > 0` (relative shift toward target cell type increased) \|
	\| `promoter_retarget` \| `target_motif_present` (the new TF motif landed in the sequence) \|

	## Per-row continuous values

	\| Field \| What it measures \|
	\|---\|---\|
	\| `edit_distance` \| Absolute Hamming distance pred ↔ ref, in bp. \|
	\| `edit_distance_pct` \| `edit_distance / len(ref)` — fraction of bases changed. \|
	\| `pred_activity_src`, `pred_activity_tgt` \| Oracle activity scores for `pred` in source / target cell type. \|
	\| `ref_activity_src`, `ref_activity_tgt` \| Oracle activity scores for `ref` in source / target cell type. \|
	\| `activity_delta_src` \| `pred_activity_src − ref_activity_src` (used for `activity_boost`). \|
	\| `activity_relative_shift` \| `(pred_tgt − pred_src) − (ref_tgt − ref_src)` (used for `cell_type_transfer`). \|

	## Aggregate metrics (paper-table column candidates)

	`mean_*` for each binary flag is the obvious "fraction of rows where the
	flag fired". The interesting non-trivial aggregates are:

	### `in_budget_at_5pct` / `in_budget_at_10pct` / `in_budget_at_20pct`

	This is the "percentage of edit distance" you asked about.

	Definition: fraction of rows where `edit_distance ≤ X% of len(ref)`,
	ignoring the row's own `edit_budget`. Lets us compare across rows
	with different assigned budgets.

	For a 500 bp reference enhancer:

	\| Threshold \| Bp budget \| Interpretation \|
	\|---\|---\|---\|
	\| `in_budget_at_5pct` \| ≤ 25 bp \| Minimal, near-surgical edit. \|
	\| `in_budget_at_10pct` \| ≤ 50 bp \| Moderate edit. \|
	\| `in_budget_at_20pct` \| ≤ 100 bp \| Substantial edit (model is rewriting). \|

	So `in_budget_at_5pct = 0.85` means **85% of the model's edits change ≤
	5% of the sequence**, i.e. the model is making minimal, focused changes.

	Why three thresholds? Different downstream applications care about
	different "small": a SNP-style retarget is OK with 5%, a CRE-style
	rewrite might allow 20%. Reporting all three lets reviewers pick the
	threshold that matches their bias. Paper precedent: Lin et al.,
	NeurIPS 2024.

	`within_budget` (no `_at_pct` suffix) is distinct — it uses the
	row's own assigned `edit_budget` (typically 10 bp absolute).
	`within_budget` and `in_budget_at_5pct` can disagree:

	* `within_budget=False, in_budget_at_5pct=True` — edit was 12 bp; row's
	budget was 10 (fail) but 5% of 500 = 25 (pass).
	* `within_budget=True, in_budget_at_5pct=False` — only happens when the
	row's budget exceeds 5% of len(ref); rare for our prod data.

	### `kmer6_diversity`

	Fraction of unique 6-mers across the predicted sequences for the cohort
	(across rows). Catches "the model collapsed to one motif" mode failure.

	### `transfer_specificity` (cell_type_transfer rows only)

	Fraction of `cell_type_transfer` rows where the prediction is both:
	* more active in target than in source, and
	* more active in target than the reference was in target.

	Both required because activating the target alone could still fail the
	"transfer" intent (the original could already have been more active in
	target).

	### `mean_target_motif_pwm_present`, `pwm_n_observed`

	Optional supplementary check using a real PWM scan against
	`--meme-file`. Falls back to None when the meme database isn't on
	disk. Confirms the IUPAC regex match isn't a false positive on a
	random A/C/G/T match.

	## Per-cell breakdown

	`per_cell_type` repeats the aggregates above with rows bucketed by
	`row.metadata.cell_type` (Ex / In / OPC / Ast / Oli / Mic / End). Lets
	us see whether the model is uniformly OK across cells or biased toward
	the over-represented Ex cell type.

	## RFT-specific multi-turn metadata

	`scripts/rft_t3.py` (with `--rounds R > 1`) adds these fields to each
	output row's `metadata`:

	\| Field \| Meaning \|
	\|---\|---\|
	\| `rft_rounds_used` \| How many sampling rounds actually ran for this row (early-stop trims this when an early round already yielded a winner). \|
	\| `rft_total_candidates` \| Total candidates the model produced for this row across all rounds. \|
	\| `rft_winner_round` \| Round index (0-based) that produced the chosen candidate. \|
	\| `rft_winner_margin` \| The objective margin of the chosen candidate (per `edit_type`: `activity_delta_src` for boost, `activity_relative_shift` for transfer, 1.0 for retarget). \|
	\| `rft_winner_edit_distance` \| Hamming distance of the chosen candidate to ref (bp). \|
	\| `rft_winner_edit_distance_pct` \| Same as a fraction of `len(ref)`. \|
	\| `rft_source` \| `"candidate"` if oracle picked a winner; `"heuristic_fallback"` if no candidate satisfied all constraints and we kept the heuristic gold. \|

	Read these to track:
	* keep-rate = fraction with `rft_source=="candidate"`
	* mean rounds-to-success = `mean(rft_rounds_used \| rft_source=="candidate")`
	* margin distribution = `rft_winner_margin` histogram

	## Where to look

	\| File \| What's in it \|
	\|---\|---\|
	\| `runs/exp_t3_*/predict_t3_{raw,enriched}/genqual/genqual_t3_oracle.json` \| Aggregate + per-cell oracle metrics for the trained adapter on the test set. \|
	\| `runs/exp_t3_grid_*/zs_{raw,enriched}/genqual/genqual_t3_oracle.json` \| Same metrics for the zero-shot LLM baseline. \|
	\| `runs/exp_t3_fusion_sft_${STAMP}/rft_filtered_train.jsonl` \| Post-RFT training JSONL. Inspect `metadata.rft_*` fields per row to see how many rounds each row needed and the winner margins. \|