explcre commited on
Commit
086e622
·
verified ·
1 Parent(s): 7a2d79b

Upload motif_targets/merged_full/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. motif_targets/merged_full/README.md +67 -0
motif_targets/merged_full/README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Motif extraction dataset for cell-type-specific enhancer generation
2
+
3
+ Per-row TF motif counts extracted from the lab's existing FIMO scans
4
+ (embedded in tool-context blocks of `pair_prediction.jsonl`). No MOODS
5
+ re-scan needed — pure regex extraction.
6
+
7
+ **Total: 1,867,057 unique enhancer/promoter sequences × 700
8
+ case-normalized TFs.**
9
+
10
+ ## Files
11
+
12
+ * `merged_full/extracted.jsonl` (1.4 GB, **the main file**) — merged dataset.
13
+ Each row:
14
+ ```json
15
+ {
16
+ "id": "train:pair_pos:OPC:LINC02593:chr1_942043_942543:57::enh",
17
+ "cell_type": "OPC",
18
+ "region": "enhancer", // or "promoter"
19
+ "source": "from_tools_uncapped", // or "from_tools_tier1"
20
+ "sequence": "CGGCAATTAGCGGAGGCGGCGGGGGAGGGGCGCCGGGGCC...",
21
+ "motif_counts_vec": [0, 12, 0, 9, 5, 0, 0, ...] // length 700, integer counts
22
+ }
23
+ ```
24
+ * `merged_full/motif_vocab.json` — TF name → vec index (700 entries, sorted alphabetically by name)
25
+ * `merged_full/summary.json` — per-cell + per-source counts
26
+ * `from_tools_uncapped/` — main `pair_prediction.jsonl` extract: 358,694 rows
27
+ * `from_tools_tier1/` — `tier1` extract: 1,508,363 rows (much less duplication)
28
+
29
+ ## Per-cell distribution
30
+
31
+ | Cell | Count |
32
+ |---|---|
33
+ | Ex | 439,121 |
34
+ | Mic | 362,755 |
35
+ | Oli | 311,423 |
36
+ | Ast | 252,907 |
37
+ | In | 249,892 |
38
+ | OPC | 207,407 |
39
+ | End | 43,552 |
40
+
41
+ ## Provenance
42
+
43
+ * Source files: `data/full_enriched_v2_with_enh_scan{,_tier1}/jsonl/train.pair_prediction.jsonl`
44
+ * Each pair_prediction row contributes up to 2 records (enhancer + promoter)
45
+ * Deduplicated by sequence hash (cross-source dedup)
46
+ * TF names case-normalized to JASPAR uppercase (e.g. `Ar`/`AR` → `AR`)
47
+ * Motif vocab matches ~80% of JASPAR2024 9606/CORE (the eval-MOODS scanner's database)
48
+
49
+ ## Usage
50
+
51
+ Load + train a motif-count regressor:
52
+ ```python
53
+ import json, torch
54
+ rows = []
55
+ with open("merged_full/extracted.jsonl") as f:
56
+ for line in f:
57
+ rows.append(json.loads(line))
58
+ print(f"{len(rows)} rows, vec dim = {len(rows[0]['motif_counts_vec'])}")
59
+ ```
60
+
61
+ Targets are integer hit counts; we recommend `torch.log1p(counts)` for
62
+ training (handles the long tail), MSE loss.
63
+
64
+ Companion code: `regureasoner_loop/scripts/train_motif_classifier_production.py`
65
+ in the GitHub repo `explcre/biomodel_reasoning_calling_study2` —
66
+ trains a `GroupedSeparateBank` (700 separate CNNs vectorized via grouped
67
+ Conv1d) on this data.