explcre
/

dnathinker-checkpoints

Model card Files Files and versions

xet

Community

explcre commited on Apr 29

Commit

086e622

verified ·

1 Parent(s): 7a2d79b

Upload motif_targets/merged_full/README.md with huggingface_hub

Browse files

Files changed (1) hide show

motif_targets/merged_full/README.md +67 -0

motif_targets/merged_full/README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# Motif extraction dataset for cell-type-specific enhancer generation
+Per-row TF motif counts extracted from the lab's existing FIMO scans
+(embedded in tool-context blocks of `pair_prediction.jsonl`). No MOODS
+re-scan needed — pure regex extraction.
+**Total: 1,867,057 unique enhancer/promoter sequences × 700
+case-normalized TFs.**
+## Files
+* `merged_full/extracted.jsonl` (1.4 GB, **the main file**) — merged dataset.
+  Each row:
+  ```json
+  {
+    "id": "train:pair_pos:OPC:LINC02593:chr1_942043_942543:57::enh",
+    "cell_type": "OPC",
+    "region": "enhancer",  // or "promoter"
+    "source": "from_tools_uncapped",  // or "from_tools_tier1"
+    "sequence": "CGGCAATTAGCGGAGGCGGCGGGGGAGGGGCGCCGGGGCC...",
+    "motif_counts_vec": [0, 12, 0, 9, 5, 0, 0, ...]  // length 700, integer counts
+  }
+  ```
+* `merged_full/motif_vocab.json` — TF name → vec index (700 entries, sorted alphabetically by name)
+* `merged_full/summary.json` — per-cell + per-source counts
+* `from_tools_uncapped/` — main `pair_prediction.jsonl` extract: 358,694 rows
+* `from_tools_tier1/` — `tier1` extract: 1,508,363 rows (much less duplication)
+## Per-cell distribution
+| Cell | Count |
+|---|---|
+| Ex   | 439,121 |
+| Mic  | 362,755 |
+| Oli  | 311,423 |
+| Ast  | 252,907 |
+| In   | 249,892 |
+| OPC  | 207,407 |
+| End  |  43,552 |
+## Provenance
+* Source files: `data/full_enriched_v2_with_enh_scan{,_tier1}/jsonl/train.pair_prediction.jsonl`
+* Each pair_prediction row contributes up to 2 records (enhancer + promoter)
+* Deduplicated by sequence hash (cross-source dedup)
+* TF names case-normalized to JASPAR uppercase (e.g. `Ar`/`AR` → `AR`)
+* Motif vocab matches ~80% of JASPAR2024 9606/CORE (the eval-MOODS scanner's database)
+## Usage
+Load + train a motif-count regressor:
+```python
+import json, torch
+rows = []
+with open("merged_full/extracted.jsonl") as f:
+    for line in f:
+        rows.append(json.loads(line))
+print(f"{len(rows)} rows, vec dim = {len(rows[0]['motif_counts_vec'])}")
+```
+Targets are integer hit counts; we recommend `torch.log1p(counts)` for
+training (handles the long tail), MSE loss.
+Companion code: `regureasoner_loop/scripts/train_motif_classifier_production.py`
+in the GitHub repo `explcre/biomodel_reasoning_calling_study2` —
+trains a `GroupedSeparateBank` (700 separate CNNs vectorized via grouped
+Conv1d) on this data.