Upload _claude_memory/project_t1_no_enhancer_scan.md with huggingface_hub
Browse files
_claude_memory/project_t1_no_enhancer_scan.md
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: T1 deliberately does NOT use enhancer-side scan (label leakage)
|
| 3 |
+
description: T1 task = generate enhancer DNA conditional on cell type + promoter; including enhancer-side scan in training would leak the answer
|
| 4 |
+
type: project
|
| 5 |
+
originSessionId: 4037f43b-2133-46c6-84bd-02f7d454ec8b
|
| 6 |
+
---
|
| 7 |
+
T1 (`enhancer_generation`) deliberately uses **promoter-only TFBS scan**
|
| 8 |
+
in its training data (`data/full_enriched/jsonl/train.enhancer_generation.jsonl`,
|
| 9 |
+
V1 schema). It does NOT use the V2 enh-scan that T2 and T3 use.
|
| 10 |
+
|
| 11 |
+
**Why:** T1's task is to GENERATE an enhancer DNA sequence from
|
| 12 |
+
{cell type, promoter, prompt}. If the input includes a TFBS scan of
|
| 13 |
+
the *enhancer*, that scan reveals motif content the model is supposed
|
| 14 |
+
to invent — direct label leakage. So T1 is intentionally V1.
|
| 15 |
+
|
| 16 |
+
**How to apply:**
|
| 17 |
+
- Plan B 200k T1: subsample from `data/full_enriched/jsonl/train.enhancer_generation.jsonl`
|
| 18 |
+
(V1, 14.20 GB, ~1.5M rows). Don't try to "upgrade" T1 to V2.
|
| 19 |
+
- T2 (pair_prediction) and T3 (enhancer_editing) DO use V2 enh-scan
|
| 20 |
+
because they need the enhancer's motifs as input context (not as label).
|
| 21 |
+
- If lab proposes a "V2 T1" with enh-scan, push back: that breaks the
|
| 22 |
+
experimental design.
|