bernardo-de-almeida commited on
Commit
b7abdc5
·
1 Parent(s): c7d398c

feat: improve style of notebooks

Browse files
notebooks/00_quickstart_inference.ipynb CHANGED
@@ -5,7 +5,7 @@
5
  "id": "024bb8a8",
6
  "metadata": {},
7
  "source": [
8
- "# NTv3 Quickstart — Pre-trained and Post-trained models\n",
9
  "\n",
10
  "This notebook demonstrates how to run **quick inference** with both the pre- and post-trained NTv3 checkpoints:\n",
11
  "\n",
@@ -18,7 +18,7 @@
18
  "2. Run a forward pass on a DNA sequence window\n",
19
  "3. Inspect key outputs\n",
20
  "\n",
21
- "> **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
22
  ]
23
  },
24
  {
@@ -26,7 +26,7 @@
26
  "id": "5d58bf1d",
27
  "metadata": {},
28
  "source": [
29
- "## 0) Colab Setup (if running on Google Colab)\n",
30
  "\n",
31
  "This cell detects if you're running on Google Colab and sets up the environment accordingly."
32
  ]
@@ -46,7 +46,7 @@
46
  "id": "5827af7e",
47
  "metadata": {},
48
  "source": [
49
- "## 1) Imports + setup"
50
  ]
51
  },
52
  {
@@ -95,7 +95,7 @@
95
  "id": "82146876",
96
  "metadata": {},
97
  "source": [
98
- "## 2) Pre-trained checkpoint (MLM-focused)\n",
99
  "\n",
100
  "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
101
  "\n",
@@ -285,7 +285,7 @@
285
  "id": "60a01798",
286
  "metadata": {},
287
  "source": [
288
- "## 3) Post-trained checkpoint (task heads: BigWig + BED)\n",
289
  "\n",
290
  "Post-trained checkpoints add task-specific heads.\n",
291
  "\n",
@@ -298,7 +298,7 @@
298
  "- `bed_tracks_logits`\n",
299
  "- `logits` (MLM)\n",
300
  "\n",
301
- "> If your post-trained checkpoint supports multiple assemblies, the config typically exposes a mapping like `cfg.bigwigs_per_file_assembly`."
302
  ]
303
  },
304
  {
 
5
  "id": "024bb8a8",
6
  "metadata": {},
7
  "source": [
8
+ "# 🚀 NTv3 Quickstart — Pre-trained and Post-trained models\n",
9
  "\n",
10
  "This notebook demonstrates how to run **quick inference** with both the pre- and post-trained NTv3 checkpoints:\n",
11
  "\n",
 
18
  "2. Run a forward pass on a DNA sequence window\n",
19
  "3. Inspect key outputs\n",
20
  "\n",
21
+ "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
22
  ]
23
  },
24
  {
 
26
  "id": "5d58bf1d",
27
  "metadata": {},
28
  "source": [
29
+ "## 0) ⚙️ Colab Setup (if running on Google Colab)\n",
30
  "\n",
31
  "This cell detects if you're running on Google Colab and sets up the environment accordingly."
32
  ]
 
46
  "id": "5827af7e",
47
  "metadata": {},
48
  "source": [
49
+ "## 1) 📦 Imports + setup"
50
  ]
51
  },
52
  {
 
95
  "id": "82146876",
96
  "metadata": {},
97
  "source": [
98
+ "## 2) 🎯 Pre-trained checkpoint (MLM-focused)\n",
99
  "\n",
100
  "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
101
  "\n",
 
285
  "id": "60a01798",
286
  "metadata": {},
287
  "source": [
288
+ "## 3) 🧠 Post-trained checkpoint (task heads: BigWig + BED)\n",
289
  "\n",
290
  "Post-trained checkpoints add task-specific heads.\n",
291
  "\n",
 
298
  "- `bed_tracks_logits`\n",
299
  "- `logits` (MLM)\n",
300
  "\n",
301
+ "> 💡 If your post-trained checkpoint supports multiple assemblies, the config typically exposes a mapping like `cfg.bigwigs_per_file_assembly`."
302
  ]
303
  },
304
  {
notebooks/01_tracks_prediction.ipynb CHANGED
@@ -5,13 +5,13 @@
5
  "id": "7adaa9f8",
6
  "metadata": {},
7
  "source": [
8
- "# NTv3 Post-Trained Inference on Human Genomic Windows\n",
9
  "\n",
10
  "This notebook demonstrates how to use the **NTv3 post-trained model** to predict functional genomics tracks and genomic element annotations from DNA sequences.\n",
11
  "\n",
12
- "> **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n",
13
  "\n",
14
- "## Overview\n",
15
  "\n",
16
  "Given a genomic window from the **human genome (hg38)**, the model performs inference and generates:\n",
17
  "\n",
@@ -19,7 +19,7 @@
19
  "- **Genomic element annotations** (`bed_tracks_logits`): Classification predictions for genomic elements such as genes, exons, introns, splice sites, promoters, enhancers, and more\n",
20
  "- **Masked Language Model logits** (`logits`): Standard transformer language model outputs\n",
21
  "\n",
22
- "## Notebook Structure\n",
23
  "\n",
24
  "1. **Setup**: Install dependencies and define the genomic window of interest\n",
25
  "2. **Data Loading**: Download and fetch the chromosome sequence from UCSC\n",
@@ -27,7 +27,7 @@
27
  "4. **Inference**: Run the model on the genomic window to generate predictions\n",
28
  "5. **Visualization**: Plot selected functional tracks and genomic element predictions together in a unified view\n",
29
  "\n",
30
- "## Additional Features\n",
31
  "\n",
32
  "- Supports multiple NTv3 post-trained models\n",
33
  "- Supports the 24 species that NTv3 was post-trained on"
@@ -89,7 +89,7 @@
89
  "id": "19db4774",
90
  "metadata": {},
91
  "source": [
92
- "## 1) Imports + configuration\n",
93
  "\n",
94
  "Set your NTv3 model and genomic window here"
95
  ]
@@ -162,7 +162,7 @@
162
  "id": "94b54a99",
163
  "metadata": {},
164
  "source": [
165
- "## 2) Fetch chromosome sequence for the chosen window"
166
  ]
167
  },
168
  {
@@ -229,7 +229,7 @@
229
  "id": "9f82945c",
230
  "metadata": {},
231
  "source": [
232
- "## 3) Load NTv3 model + tokenizers"
233
  ]
234
  },
235
  {
@@ -302,7 +302,7 @@
302
  "id": "70413b72",
303
  "metadata": {},
304
  "source": [
305
- "## 4) Tokenize the window and run inference\n",
306
  "\n",
307
  "We pass:\n",
308
  "\n",
@@ -360,7 +360,7 @@
360
  "id": "b8423e62",
361
  "metadata": {},
362
  "source": [
363
- "## 5) Plot functional tracks and genome annotation predictions\n",
364
  "\n",
365
  "This plots track probabilities for selected functional tracks and genomic elements.\n",
366
  "\n",
@@ -391,12 +391,12 @@
391
  },
392
  {
393
  "cell_type": "code",
394
- "execution_count": 15,
395
  "id": "717539e2",
396
  "metadata": {},
397
  "outputs": [],
398
  "source": [
399
- "### Select functional tracks to plot\n",
400
  "tracks_to_plot = {\n",
401
  " \"K562 RNA-seq\": \"ENCSR056HPM\",\n",
402
  " \"K562 DNAse\": \"ENCSR921NMD\",\n",
@@ -416,7 +416,7 @@
416
  " f\"Available tracks: {bigwig_names}\"\n",
417
  " )\n",
418
  " \n",
419
- "### Select genomic elements to plot\n",
420
  "elements_to_plot = [\n",
421
  " \"protein_coding_gene\",\n",
422
  " \"exon\",\n",
@@ -491,7 +491,7 @@
491
  "id": "1ce34dc4",
492
  "metadata": {},
493
  "source": [
494
- "# To improve\n",
495
  "- Add gene annotation at top"
496
  ]
497
  }
 
5
  "id": "7adaa9f8",
6
  "metadata": {},
7
  "source": [
8
+ "# 🧬 NTv3 Post-Trained Inference on Human Genomic Windows\n",
9
  "\n",
10
  "This notebook demonstrates how to use the **NTv3 post-trained model** to predict functional genomics tracks and genomic element annotations from DNA sequences.\n",
11
  "\n",
12
+ "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n",
13
  "\n",
14
+ "## 📋 Overview\n",
15
  "\n",
16
  "Given a genomic window from the **human genome (hg38)**, the model performs inference and generates:\n",
17
  "\n",
 
19
  "- **Genomic element annotations** (`bed_tracks_logits`): Classification predictions for genomic elements such as genes, exons, introns, splice sites, promoters, enhancers, and more\n",
20
  "- **Masked Language Model logits** (`logits`): Standard transformer language model outputs\n",
21
  "\n",
22
+ "## 📚 Notebook Structure\n",
23
  "\n",
24
  "1. **Setup**: Install dependencies and define the genomic window of interest\n",
25
  "2. **Data Loading**: Download and fetch the chromosome sequence from UCSC\n",
 
27
  "4. **Inference**: Run the model on the genomic window to generate predictions\n",
28
  "5. **Visualization**: Plot selected functional tracks and genomic element predictions together in a unified view\n",
29
  "\n",
30
+ "## Additional Features\n",
31
  "\n",
32
  "- Supports multiple NTv3 post-trained models\n",
33
  "- Supports the 24 species that NTv3 was post-trained on"
 
89
  "id": "19db4774",
90
  "metadata": {},
91
  "source": [
92
+ "## 1) 📦 Imports + configuration\n",
93
  "\n",
94
  "Set your NTv3 model and genomic window here"
95
  ]
 
162
  "id": "94b54a99",
163
  "metadata": {},
164
  "source": [
165
+ "## 2) 📥 Fetch chromosome sequence for the chosen window"
166
  ]
167
  },
168
  {
 
229
  "id": "9f82945c",
230
  "metadata": {},
231
  "source": [
232
+ "## 3) 🤖 Load NTv3 model + tokenizers"
233
  ]
234
  },
235
  {
 
302
  "id": "70413b72",
303
  "metadata": {},
304
  "source": [
305
+ "## 4) Tokenize the window and run inference\n",
306
  "\n",
307
  "We pass:\n",
308
  "\n",
 
360
  "id": "b8423e62",
361
  "metadata": {},
362
  "source": [
363
+ "## 5) 📊 Plot functional tracks and genome annotation predictions\n",
364
  "\n",
365
  "This plots track probabilities for selected functional tracks and genomic elements.\n",
366
  "\n",
 
391
  },
392
  {
393
  "cell_type": "code",
394
+ "execution_count": null,
395
  "id": "717539e2",
396
  "metadata": {},
397
  "outputs": [],
398
  "source": [
399
+ "### 🎯 Select functional tracks to plot\n",
400
  "tracks_to_plot = {\n",
401
  " \"K562 RNA-seq\": \"ENCSR056HPM\",\n",
402
  " \"K562 DNAse\": \"ENCSR921NMD\",\n",
 
416
  " f\"Available tracks: {bigwig_names}\"\n",
417
  " )\n",
418
  " \n",
419
+ "### 🧬 Select genomic elements to plot\n",
420
  "elements_to_plot = [\n",
421
  " \"protein_coding_gene\",\n",
422
  " \"exon\",\n",
 
491
  "id": "1ce34dc4",
492
  "metadata": {},
493
  "source": [
494
+ "# 💡 To improve\n",
495
  "- Add gene annotation at top"
496
  ]
497
  }