explcre commited on
Commit
05264d1
·
verified ·
1 Parent(s): 28eaef2

Upload _claude_memory/project_t1_no_enhancer_scan.md with huggingface_hub

Browse files
_claude_memory/project_t1_no_enhancer_scan.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: T1 deliberately does NOT use enhancer-side scan (label leakage)
3
+ description: T1 task = generate enhancer DNA conditional on cell type + promoter; including enhancer-side scan in training would leak the answer
4
+ type: project
5
+ originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
6
+ ---
7
+ T1 (`enhancer_generation`) deliberately uses **promoter-only TFBS scan**
8
+ in its training data (`data/full_enriched/jsonl/train.enhancer_generation.jsonl`,
9
+ V1 schema). It does NOT use the V2 enh-scan that T2 and T3 use.
10
+
11
+ **Why:** T1's task is to GENERATE an enhancer DNA sequence from
12
+ {cell type, promoter, prompt}. If the input includes a TFBS scan of
13
+ the *enhancer*, that scan reveals motif content the model is supposed
14
+ to invent — direct label leakage. So T1 is intentionally V1.
15
+
16
+ **How to apply:**
17
+ - Plan B 200k T1: subsample from `data/full_enriched/jsonl/train.enhancer_generation.jsonl`
18
+ (V1, 14.20 GB, ~1.5M rows). Don't try to "upgrade" T1 to V2.
19
+ - T2 (pair_prediction) and T3 (enhancer_editing) DO use V2 enh-scan
20
+ because they need the enhancer's motifs as input context (not as label).
21
+ - If lab proposes a "V2 T1" with enh-scan, push back: that breaks the
22
+ experimental design.