phase8_rl / _claude_memory /feedback_t1_metrics_complete.md
explcre's picture
Upload _claude_memory/feedback_t1_metrics_complete.md with huggingface_hub
5bc6d83 verified
metadata
name: T1 evaluation metrics must include ALL categories
description: >-
  Whenever showing T1 (enhancer_generation) evaluation results, always report
  oracle + LEONINE motif + basic metrics together — never just one slice
type: feedback
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b

When reporting T1 evaluation metrics, ALWAYS include all of these categories together in the same table/summary:

  1. Oracle metrics (cell-type specificity via the v3 separate-7 classifier):

    • gen_top1 (joint argmax over 7 sigmoid outputs)
    • gen_mean_auroc (average per-cell AUROC)
    • gen_target_ce (cross-entropy of softmax(logits) against target cell)
    • gold_top1 (sanity baseline — should be ~0.97 with v3 separate-7)
    • per-cell recall + AUROC breakdown
  2. LEONINE-style motif metrics (cell-type-specific motif distribution):

    • JS divergence heatmap (matched cell vs unmatched cells)
    • diagonal-vs-offdiag specificity ratio
    • Frobenius norm of heatmap difference vs gold
    • per-cell motif distribution similarity
  3. CtrlDNA-style motif metrics:

    • per-cell motif correlation R² (gen vs gold motif counts)
    • aggregate correlation across cells
  4. TF-program-filtered specificity (lab's TF panel):

    • tf_program_specificity per cell
  5. Basic validity / fluency:

    • parse_rate (ACGT-only, length-valid sequences)
    • mean_length_ratio (gen length / gold length)
    • mean_gc_abs_err (|GC_gen − GC_gold|)
    • kmer_shannon_entropy, kmer_unique_frac (diversity)
    • homopolymer + tandem repeat fractions

Why: the user is paper-grade and explicitly told me "next time anytime you show evaluation metrics for t1 you should include them all" with three exclamation marks. A partial metrics table (e.g., only oracle, or only motif) is a shortcut that hides failure modes the user needs to see.

How to apply: after running score_with_classifier_oracle.py on a prediction file, also run run_t1_eval.py --scanner moods (or fimo) to get the motif + basic metrics. Combine into one table per variant. If MOODS/FIMO is still running, say so explicitly and note which categories are pending — don't quietly omit them.