Spaces:
Running
Running
Commit
·
b7abdc5
1
Parent(s):
c7d398c
feat: improve style of notebooks
Browse files
notebooks/00_quickstart_inference.ipynb
CHANGED
|
@@ -5,7 +5,7 @@
|
|
| 5 |
"id": "024bb8a8",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
-
"# NTv3 Quickstart — Pre-trained and Post-trained models\n",
|
| 9 |
"\n",
|
| 10 |
"This notebook demonstrates how to run **quick inference** with both the pre- and post-trained NTv3 checkpoints:\n",
|
| 11 |
"\n",
|
|
@@ -18,7 +18,7 @@
|
|
| 18 |
"2. Run a forward pass on a DNA sequence window\n",
|
| 19 |
"3. Inspect key outputs\n",
|
| 20 |
"\n",
|
| 21 |
-
"> **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
|
| 22 |
]
|
| 23 |
},
|
| 24 |
{
|
|
@@ -26,7 +26,7 @@
|
|
| 26 |
"id": "5d58bf1d",
|
| 27 |
"metadata": {},
|
| 28 |
"source": [
|
| 29 |
-
"## 0) Colab Setup (if running on Google Colab)\n",
|
| 30 |
"\n",
|
| 31 |
"This cell detects if you're running on Google Colab and sets up the environment accordingly."
|
| 32 |
]
|
|
@@ -46,7 +46,7 @@
|
|
| 46 |
"id": "5827af7e",
|
| 47 |
"metadata": {},
|
| 48 |
"source": [
|
| 49 |
-
"## 1) Imports + setup"
|
| 50 |
]
|
| 51 |
},
|
| 52 |
{
|
|
@@ -95,7 +95,7 @@
|
|
| 95 |
"id": "82146876",
|
| 96 |
"metadata": {},
|
| 97 |
"source": [
|
| 98 |
-
"## 2) Pre-trained checkpoint (MLM-focused)\n",
|
| 99 |
"\n",
|
| 100 |
"This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
|
| 101 |
"\n",
|
|
@@ -285,7 +285,7 @@
|
|
| 285 |
"id": "60a01798",
|
| 286 |
"metadata": {},
|
| 287 |
"source": [
|
| 288 |
-
"## 3) Post-trained checkpoint (task heads: BigWig + BED)\n",
|
| 289 |
"\n",
|
| 290 |
"Post-trained checkpoints add task-specific heads.\n",
|
| 291 |
"\n",
|
|
@@ -298,7 +298,7 @@
|
|
| 298 |
"- `bed_tracks_logits`\n",
|
| 299 |
"- `logits` (MLM)\n",
|
| 300 |
"\n",
|
| 301 |
-
"> If your post-trained checkpoint supports multiple assemblies, the config typically exposes a mapping like `cfg.bigwigs_per_file_assembly`."
|
| 302 |
]
|
| 303 |
},
|
| 304 |
{
|
|
|
|
| 5 |
"id": "024bb8a8",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
+
"# 🚀 NTv3 Quickstart — Pre-trained and Post-trained models\n",
|
| 9 |
"\n",
|
| 10 |
"This notebook demonstrates how to run **quick inference** with both the pre- and post-trained NTv3 checkpoints:\n",
|
| 11 |
"\n",
|
|
|
|
| 18 |
"2. Run a forward pass on a DNA sequence window\n",
|
| 19 |
"3. Inspect key outputs\n",
|
| 20 |
"\n",
|
| 21 |
+
"> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
|
| 22 |
]
|
| 23 |
},
|
| 24 |
{
|
|
|
|
| 26 |
"id": "5d58bf1d",
|
| 27 |
"metadata": {},
|
| 28 |
"source": [
|
| 29 |
+
"## 0) ⚙️ Colab Setup (if running on Google Colab)\n",
|
| 30 |
"\n",
|
| 31 |
"This cell detects if you're running on Google Colab and sets up the environment accordingly."
|
| 32 |
]
|
|
|
|
| 46 |
"id": "5827af7e",
|
| 47 |
"metadata": {},
|
| 48 |
"source": [
|
| 49 |
+
"## 1) 📦 Imports + setup"
|
| 50 |
]
|
| 51 |
},
|
| 52 |
{
|
|
|
|
| 95 |
"id": "82146876",
|
| 96 |
"metadata": {},
|
| 97 |
"source": [
|
| 98 |
+
"## 2) 🎯 Pre-trained checkpoint (MLM-focused)\n",
|
| 99 |
"\n",
|
| 100 |
"This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
|
| 101 |
"\n",
|
|
|
|
| 285 |
"id": "60a01798",
|
| 286 |
"metadata": {},
|
| 287 |
"source": [
|
| 288 |
+
"## 3) 🧠 Post-trained checkpoint (task heads: BigWig + BED)\n",
|
| 289 |
"\n",
|
| 290 |
"Post-trained checkpoints add task-specific heads.\n",
|
| 291 |
"\n",
|
|
|
|
| 298 |
"- `bed_tracks_logits`\n",
|
| 299 |
"- `logits` (MLM)\n",
|
| 300 |
"\n",
|
| 301 |
+
"> 💡 If your post-trained checkpoint supports multiple assemblies, the config typically exposes a mapping like `cfg.bigwigs_per_file_assembly`."
|
| 302 |
]
|
| 303 |
},
|
| 304 |
{
|
notebooks/01_tracks_prediction.ipynb
CHANGED
|
@@ -5,13 +5,13 @@
|
|
| 5 |
"id": "7adaa9f8",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
-
"# NTv3 Post-Trained Inference on Human Genomic Windows\n",
|
| 9 |
"\n",
|
| 10 |
"This notebook demonstrates how to use the **NTv3 post-trained model** to predict functional genomics tracks and genomic element annotations from DNA sequences.\n",
|
| 11 |
"\n",
|
| 12 |
-
"> **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n",
|
| 13 |
"\n",
|
| 14 |
-
"## Overview\n",
|
| 15 |
"\n",
|
| 16 |
"Given a genomic window from the **human genome (hg38)**, the model performs inference and generates:\n",
|
| 17 |
"\n",
|
|
@@ -19,7 +19,7 @@
|
|
| 19 |
"- **Genomic element annotations** (`bed_tracks_logits`): Classification predictions for genomic elements such as genes, exons, introns, splice sites, promoters, enhancers, and more\n",
|
| 20 |
"- **Masked Language Model logits** (`logits`): Standard transformer language model outputs\n",
|
| 21 |
"\n",
|
| 22 |
-
"## Notebook Structure\n",
|
| 23 |
"\n",
|
| 24 |
"1. **Setup**: Install dependencies and define the genomic window of interest\n",
|
| 25 |
"2. **Data Loading**: Download and fetch the chromosome sequence from UCSC\n",
|
|
@@ -27,7 +27,7 @@
|
|
| 27 |
"4. **Inference**: Run the model on the genomic window to generate predictions\n",
|
| 28 |
"5. **Visualization**: Plot selected functional tracks and genomic element predictions together in a unified view\n",
|
| 29 |
"\n",
|
| 30 |
-
"## Additional Features\n",
|
| 31 |
"\n",
|
| 32 |
"- Supports multiple NTv3 post-trained models\n",
|
| 33 |
"- Supports the 24 species that NTv3 was post-trained on"
|
|
@@ -89,7 +89,7 @@
|
|
| 89 |
"id": "19db4774",
|
| 90 |
"metadata": {},
|
| 91 |
"source": [
|
| 92 |
-
"## 1) Imports + configuration\n",
|
| 93 |
"\n",
|
| 94 |
"Set your NTv3 model and genomic window here"
|
| 95 |
]
|
|
@@ -162,7 +162,7 @@
|
|
| 162 |
"id": "94b54a99",
|
| 163 |
"metadata": {},
|
| 164 |
"source": [
|
| 165 |
-
"## 2) Fetch chromosome sequence for the chosen window"
|
| 166 |
]
|
| 167 |
},
|
| 168 |
{
|
|
@@ -229,7 +229,7 @@
|
|
| 229 |
"id": "9f82945c",
|
| 230 |
"metadata": {},
|
| 231 |
"source": [
|
| 232 |
-
"## 3) Load NTv3 model + tokenizers"
|
| 233 |
]
|
| 234 |
},
|
| 235 |
{
|
|
@@ -302,7 +302,7 @@
|
|
| 302 |
"id": "70413b72",
|
| 303 |
"metadata": {},
|
| 304 |
"source": [
|
| 305 |
-
"## 4) Tokenize the window and run inference\n",
|
| 306 |
"\n",
|
| 307 |
"We pass:\n",
|
| 308 |
"\n",
|
|
@@ -360,7 +360,7 @@
|
|
| 360 |
"id": "b8423e62",
|
| 361 |
"metadata": {},
|
| 362 |
"source": [
|
| 363 |
-
"## 5) Plot functional tracks and genome annotation predictions\n",
|
| 364 |
"\n",
|
| 365 |
"This plots track probabilities for selected functional tracks and genomic elements.\n",
|
| 366 |
"\n",
|
|
@@ -391,12 +391,12 @@
|
|
| 391 |
},
|
| 392 |
{
|
| 393 |
"cell_type": "code",
|
| 394 |
-
"execution_count":
|
| 395 |
"id": "717539e2",
|
| 396 |
"metadata": {},
|
| 397 |
"outputs": [],
|
| 398 |
"source": [
|
| 399 |
-
"### Select functional tracks to plot\n",
|
| 400 |
"tracks_to_plot = {\n",
|
| 401 |
" \"K562 RNA-seq\": \"ENCSR056HPM\",\n",
|
| 402 |
" \"K562 DNAse\": \"ENCSR921NMD\",\n",
|
|
@@ -416,7 +416,7 @@
|
|
| 416 |
" f\"Available tracks: {bigwig_names}\"\n",
|
| 417 |
" )\n",
|
| 418 |
" \n",
|
| 419 |
-
"### Select genomic elements to plot\n",
|
| 420 |
"elements_to_plot = [\n",
|
| 421 |
" \"protein_coding_gene\",\n",
|
| 422 |
" \"exon\",\n",
|
|
@@ -491,7 +491,7 @@
|
|
| 491 |
"id": "1ce34dc4",
|
| 492 |
"metadata": {},
|
| 493 |
"source": [
|
| 494 |
-
"# To improve\n",
|
| 495 |
"- Add gene annotation at top"
|
| 496 |
]
|
| 497 |
}
|
|
|
|
| 5 |
"id": "7adaa9f8",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
+
"# 🧬 NTv3 Post-Trained Inference on Human Genomic Windows\n",
|
| 9 |
"\n",
|
| 10 |
"This notebook demonstrates how to use the **NTv3 post-trained model** to predict functional genomics tracks and genomic element annotations from DNA sequences.\n",
|
| 11 |
"\n",
|
| 12 |
+
"> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n",
|
| 13 |
"\n",
|
| 14 |
+
"## 📋 Overview\n",
|
| 15 |
"\n",
|
| 16 |
"Given a genomic window from the **human genome (hg38)**, the model performs inference and generates:\n",
|
| 17 |
"\n",
|
|
|
|
| 19 |
"- **Genomic element annotations** (`bed_tracks_logits`): Classification predictions for genomic elements such as genes, exons, introns, splice sites, promoters, enhancers, and more\n",
|
| 20 |
"- **Masked Language Model logits** (`logits`): Standard transformer language model outputs\n",
|
| 21 |
"\n",
|
| 22 |
+
"## 📚 Notebook Structure\n",
|
| 23 |
"\n",
|
| 24 |
"1. **Setup**: Install dependencies and define the genomic window of interest\n",
|
| 25 |
"2. **Data Loading**: Download and fetch the chromosome sequence from UCSC\n",
|
|
|
|
| 27 |
"4. **Inference**: Run the model on the genomic window to generate predictions\n",
|
| 28 |
"5. **Visualization**: Plot selected functional tracks and genomic element predictions together in a unified view\n",
|
| 29 |
"\n",
|
| 30 |
+
"## ✨ Additional Features\n",
|
| 31 |
"\n",
|
| 32 |
"- Supports multiple NTv3 post-trained models\n",
|
| 33 |
"- Supports the 24 species that NTv3 was post-trained on"
|
|
|
|
| 89 |
"id": "19db4774",
|
| 90 |
"metadata": {},
|
| 91 |
"source": [
|
| 92 |
+
"## 1) 📦 Imports + configuration\n",
|
| 93 |
"\n",
|
| 94 |
"Set your NTv3 model and genomic window here"
|
| 95 |
]
|
|
|
|
| 162 |
"id": "94b54a99",
|
| 163 |
"metadata": {},
|
| 164 |
"source": [
|
| 165 |
+
"## 2) 📥 Fetch chromosome sequence for the chosen window"
|
| 166 |
]
|
| 167 |
},
|
| 168 |
{
|
|
|
|
| 229 |
"id": "9f82945c",
|
| 230 |
"metadata": {},
|
| 231 |
"source": [
|
| 232 |
+
"## 3) 🤖 Load NTv3 model + tokenizers"
|
| 233 |
]
|
| 234 |
},
|
| 235 |
{
|
|
|
|
| 302 |
"id": "70413b72",
|
| 303 |
"metadata": {},
|
| 304 |
"source": [
|
| 305 |
+
"## 4) ⚡ Tokenize the window and run inference\n",
|
| 306 |
"\n",
|
| 307 |
"We pass:\n",
|
| 308 |
"\n",
|
|
|
|
| 360 |
"id": "b8423e62",
|
| 361 |
"metadata": {},
|
| 362 |
"source": [
|
| 363 |
+
"## 5) 📊 Plot functional tracks and genome annotation predictions\n",
|
| 364 |
"\n",
|
| 365 |
"This plots track probabilities for selected functional tracks and genomic elements.\n",
|
| 366 |
"\n",
|
|
|
|
| 391 |
},
|
| 392 |
{
|
| 393 |
"cell_type": "code",
|
| 394 |
+
"execution_count": null,
|
| 395 |
"id": "717539e2",
|
| 396 |
"metadata": {},
|
| 397 |
"outputs": [],
|
| 398 |
"source": [
|
| 399 |
+
"### 🎯 Select functional tracks to plot\n",
|
| 400 |
"tracks_to_plot = {\n",
|
| 401 |
" \"K562 RNA-seq\": \"ENCSR056HPM\",\n",
|
| 402 |
" \"K562 DNAse\": \"ENCSR921NMD\",\n",
|
|
|
|
| 416 |
" f\"Available tracks: {bigwig_names}\"\n",
|
| 417 |
" )\n",
|
| 418 |
" \n",
|
| 419 |
+
"### 🧬 Select genomic elements to plot\n",
|
| 420 |
"elements_to_plot = [\n",
|
| 421 |
" \"protein_coding_gene\",\n",
|
| 422 |
" \"exon\",\n",
|
|
|
|
| 491 |
"id": "1ce34dc4",
|
| 492 |
"metadata": {},
|
| 493 |
"source": [
|
| 494 |
+
"# 💡 To improve\n",
|
| 495 |
"- Add gene annotation at top"
|
| 496 |
]
|
| 497 |
}
|