Upload motif_targets/merged_full/README.md with huggingface_hub
Browse files
motif_targets/merged_full/README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Motif extraction dataset for cell-type-specific enhancer generation
|
| 2 |
+
|
| 3 |
+
Per-row TF motif counts extracted from the lab's existing FIMO scans
|
| 4 |
+
(embedded in tool-context blocks of `pair_prediction.jsonl`). No MOODS
|
| 5 |
+
re-scan needed — pure regex extraction.
|
| 6 |
+
|
| 7 |
+
**Total: 1,867,057 unique enhancer/promoter sequences × 700
|
| 8 |
+
case-normalized TFs.**
|
| 9 |
+
|
| 10 |
+
## Files
|
| 11 |
+
|
| 12 |
+
* `merged_full/extracted.jsonl` (1.4 GB, **the main file**) — merged dataset.
|
| 13 |
+
Each row:
|
| 14 |
+
```json
|
| 15 |
+
{
|
| 16 |
+
"id": "train:pair_pos:OPC:LINC02593:chr1_942043_942543:57::enh",
|
| 17 |
+
"cell_type": "OPC",
|
| 18 |
+
"region": "enhancer", // or "promoter"
|
| 19 |
+
"source": "from_tools_uncapped", // or "from_tools_tier1"
|
| 20 |
+
"sequence": "CGGCAATTAGCGGAGGCGGCGGGGGAGGGGCGCCGGGGCC...",
|
| 21 |
+
"motif_counts_vec": [0, 12, 0, 9, 5, 0, 0, ...] // length 700, integer counts
|
| 22 |
+
}
|
| 23 |
+
```
|
| 24 |
+
* `merged_full/motif_vocab.json` — TF name → vec index (700 entries, sorted alphabetically by name)
|
| 25 |
+
* `merged_full/summary.json` — per-cell + per-source counts
|
| 26 |
+
* `from_tools_uncapped/` — main `pair_prediction.jsonl` extract: 358,694 rows
|
| 27 |
+
* `from_tools_tier1/` — `tier1` extract: 1,508,363 rows (much less duplication)
|
| 28 |
+
|
| 29 |
+
## Per-cell distribution
|
| 30 |
+
|
| 31 |
+
| Cell | Count |
|
| 32 |
+
|---|---|
|
| 33 |
+
| Ex | 439,121 |
|
| 34 |
+
| Mic | 362,755 |
|
| 35 |
+
| Oli | 311,423 |
|
| 36 |
+
| Ast | 252,907 |
|
| 37 |
+
| In | 249,892 |
|
| 38 |
+
| OPC | 207,407 |
|
| 39 |
+
| End | 43,552 |
|
| 40 |
+
|
| 41 |
+
## Provenance
|
| 42 |
+
|
| 43 |
+
* Source files: `data/full_enriched_v2_with_enh_scan{,_tier1}/jsonl/train.pair_prediction.jsonl`
|
| 44 |
+
* Each pair_prediction row contributes up to 2 records (enhancer + promoter)
|
| 45 |
+
* Deduplicated by sequence hash (cross-source dedup)
|
| 46 |
+
* TF names case-normalized to JASPAR uppercase (e.g. `Ar`/`AR` → `AR`)
|
| 47 |
+
* Motif vocab matches ~80% of JASPAR2024 9606/CORE (the eval-MOODS scanner's database)
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
Load + train a motif-count regressor:
|
| 52 |
+
```python
|
| 53 |
+
import json, torch
|
| 54 |
+
rows = []
|
| 55 |
+
with open("merged_full/extracted.jsonl") as f:
|
| 56 |
+
for line in f:
|
| 57 |
+
rows.append(json.loads(line))
|
| 58 |
+
print(f"{len(rows)} rows, vec dim = {len(rows[0]['motif_counts_vec'])}")
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
Targets are integer hit counts; we recommend `torch.log1p(counts)` for
|
| 62 |
+
training (handles the long tail), MSE loss.
|
| 63 |
+
|
| 64 |
+
Companion code: `regureasoner_loop/scripts/train_motif_classifier_production.py`
|
| 65 |
+
in the GitHub repo `explcre/biomodel_reasoning_calling_study2` —
|
| 66 |
+
trains a `GroupedSeparateBank` (700 separate CNNs vectorized via grouped
|
| 67 |
+
Conv1d) on this data.
|