explcre
/

phase8_rl

explcre commited on May 7

Commit

05264d1

verified ·

1 Parent(s): 28eaef2

Upload _claude_memory/project_t1_no_enhancer_scan.md with huggingface_hub

Files changed (1) hide show

_claude_memory/project_t1_no_enhancer_scan.md ADDED Viewed

+---
+name: T1 deliberately does NOT use enhancer-side scan (label leakage)
+description: T1 task = generate enhancer DNA conditional on cell type + promoter; including enhancer-side scan in training would leak the answer
+type: project
+originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
+---
+T1 (`enhancer_generation`) deliberately uses **promoter-only TFBS scan**
+in its training data (`data/full_enriched/jsonl/train.enhancer_generation.jsonl`,
+V1 schema). It does NOT use the V2 enh-scan that T2 and T3 use.
+**Why:** T1's task is to GENERATE an enhancer DNA sequence from
+{cell type, promoter, prompt}. If the input includes a TFBS scan of
+the *enhancer*, that scan reveals motif content the model is supposed
+to invent — direct label leakage. So T1 is intentionally V1.
+**How to apply:**
+- Plan B 200k T1: subsample from `data/full_enriched/jsonl/train.enhancer_generation.jsonl`
+  (V1, 14.20 GB, ~1.5M rows). Don't try to "upgrade" T1 to V2.
+- T2 (pair_prediction) and T3 (enhancer_editing) DO use V2 enh-scan
+  because they need the enhancer's motifs as input context (not as label).
+- If lab proposes a "V2 T1" with enh-scan, push back: that breaks the
+  experimental design.