tpierrot commited on
Commit
2101d19
·
verified ·
1 Parent(s): 85f3120

Upload 02_genome_annotation.ipynb

Browse files
Files changed (1) hide show
  1. notebooks/02_genome_annotation.ipynb +239 -0
notebooks/02_genome_annotation.ipynb ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "1ee06421",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 🧬 NTv3 Post-Trained Genome Annotation\n",
9
+ "\n",
10
+ "This notebook demonstrates how to use the NTv3 post-trained model to perform genome annotation directly from a DNA sequence. It relies on a pipeline that applies a Hidden Markov Model (HMM) to the per-base probabilities returned by NTv3, converting them into a coherent gene model that respects biological constraints and valid transitions between genomic elements.\n",
11
+ "\n",
12
+ "The pipeline abstracts away all the underlying steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation. It returns a ready-to-use GFF file that can be visualized in any genome browser for the sequence of interest.\n",
13
+ "\n",
14
+ "If you’re interested in exploring the intermediate probabilities, please refer to the track-prediction notebooks. These probabilities can be useful for assessing model confidence and identifying potentially interesting biological regions. This notebook focuses on the higher-level task of producing gene annotations directly from raw DNA.\n",
15
+ "\n",
16
+ "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "71fac239",
22
+ "metadata": {},
23
+ "source": [
24
+ "## 0) Colab Setup (if running on Google Colab)\n",
25
+ "\n",
26
+ "This cell detects if you're running on Google Colab and sets up the environment accordingly."
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "code",
31
+ "execution_count": null,
32
+ "id": "2e2f5963",
33
+ "metadata": {},
34
+ "outputs": [],
35
+ "source": [
36
+ "# Install dependencies\n",
37
+ "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "id": "36d32e97",
43
+ "metadata": {},
44
+ "source": [
45
+ "## 1) 📦 Imports + configuration\n",
46
+ "\n",
47
+ "Set your NTv3 model and genomic window here"
48
+ ]
49
+ },
50
+ {
51
+ "cell_type": "code",
52
+ "execution_count": null,
53
+ "id": "3f0a8e73",
54
+ "metadata": {},
55
+ "outputs": [],
56
+ "source": [
57
+ "import re\n",
58
+ "import time\n",
59
+ "import torch\n",
60
+ "import requests\n",
61
+ "from transformers import pipeline"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": null,
67
+ "id": "423af70a",
68
+ "metadata": {},
69
+ "outputs": [],
70
+ "source": [
71
+ "# Define the model and genomic window\n",
72
+ "model_name = \"InstaDeepAI/NTv3_650M\"\n",
73
+ "assembly = \"hg38\"\n",
74
+ "chrom = \"chr19\"\n",
75
+ "start = 6_700_000\n",
76
+ "end = 6_831_072"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "id": "aee9541c",
82
+ "metadata": {},
83
+ "source": [
84
+ "## 2) 📥 Fetch chromosome sequence for the chosen window"
85
+ ]
86
+ },
87
+ {
88
+ "cell_type": "code",
89
+ "execution_count": null,
90
+ "id": "b34378f1",
91
+ "metadata": {},
92
+ "outputs": [],
93
+ "source": [
94
+ "# Get the sequence from the UCSC API\n",
95
+ "url = f\"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}\"\n",
96
+ "seq = requests.get(url).json()[\"dna\"].upper()\n",
97
+ "print(f\"Original sequence length: {len(seq)}\")\n",
98
+ "\n",
99
+ "# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)\n",
100
+ "seq = seq[:int(len(seq) // 128) * 128]\n",
101
+ "print(f\"Cropped sequence length: {len(seq)}, {len(seq) / 128} tokens\")"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "markdown",
106
+ "id": "442c4b03",
107
+ "metadata": {},
108
+ "source": [
109
+ "## 3) ⚡ Genome annotation pipeline (pre-processing, inference, post-processing)"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "code",
114
+ "execution_count": null,
115
+ "id": "4857d15c",
116
+ "metadata": {},
117
+ "outputs": [],
118
+ "source": [
119
+ "# Build NTv3 GFF pipeline\n",
120
+ "ntv3_gff = pipeline(\n",
121
+ " \"ntv3-gff\",\n",
122
+ " model=model_name,\n",
123
+ " trust_remote_code=True,\n",
124
+ " device=0 if torch.cuda.is_available() else -1,\n",
125
+ ")\n",
126
+ "\n",
127
+ "# Run pipeline: DNA -> NTv3 -> HMM -> GFF3\n",
128
+ "inputs = {\n",
129
+ " \"sequence\": seq,\n",
130
+ " \"chrom\": chrom,\n",
131
+ " \"start\": start,\n",
132
+ " \"end\": end,\n",
133
+ " \"assembly\": assembly,\n",
134
+ "}\n",
135
+ "\n",
136
+ "# Run the pipeline\n",
137
+ "start_time = time.time()\n",
138
+ "gff_text = ntv3_gff(inputs)\n",
139
+ "end_time = time.time()\n",
140
+ "print(f\"Inference + decoding time: {end_time - start_time:.2f} seconds\")"
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "markdown",
145
+ "id": "190ff65e",
146
+ "metadata": {},
147
+ "source": [
148
+ "## 4) 📁 Save a GFF file"
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "code",
153
+ "execution_count": null,
154
+ "id": "959cf79f",
155
+ "metadata": {},
156
+ "outputs": [],
157
+ "source": [
158
+ "# Save GFF3 file\n",
159
+ "short_model_name_match = re.search(r\"[^/]+$\", model_name)\n",
160
+ "short_model_name = short_model_name_match.group() if short_model_name_match else model_name\n",
161
+ "\n",
162
+ "output_filename = f\"{short_model_name}_{assembly}_{chrom}_{start}_{end}.gff3\"\n",
163
+ "with open(output_filename, \"w\") as output_file:\n",
164
+ " output_file.write(gff_text)\n",
165
+ "\n",
166
+ "print(f\"Saved GFF file to {output_filename}\")"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "markdown",
171
+ "id": "291e0710",
172
+ "metadata": {},
173
+ "source": [
174
+ "## 5) 🌐 Create an IGV Browser"
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "code",
179
+ "execution_count": null,
180
+ "id": "84f013f6",
181
+ "metadata": {},
182
+ "outputs": [],
183
+ "source": [
184
+ "import igv_notebook\n",
185
+ "\n",
186
+ "igv_notebook.init()"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "code",
191
+ "execution_count": null,
192
+ "id": "0904a5cb",
193
+ "metadata": {},
194
+ "outputs": [],
195
+ "source": [
196
+ "config = {\n",
197
+ " \"genome\": \"hg38\", # built-in hg38\n",
198
+ " \"locus\": f\"{chrom}:{start}-{end}\",\n",
199
+ "}\n",
200
+ "\n",
201
+ "gff_track = {\n",
202
+ " \"name\": \"NTv3 annotations\",\n",
203
+ " \"format\": \"gff3\",\n",
204
+ " \"type\": \"annotation\",\n",
205
+ " \"url\": output_filename, # just the filename\n",
206
+ " # \"height\": 200,\n",
207
+ "}\n",
208
+ "\n",
209
+ "browser = igv_notebook.Browser(config)\n",
210
+ "browser.load_track(gff_track)\n",
211
+ "\n",
212
+ "# Re-center on the region, just to be sure\n",
213
+ "browser.search(f\"{chrom}:{start}-{end}\")\n",
214
+ "browser # <- just return the object, no .show()"
215
+ ]
216
+ }
217
+ ],
218
+ "metadata": {
219
+ "kernelspec": {
220
+ "display_name": "Python 3 (ipykernel)",
221
+ "language": "python",
222
+ "name": "python3"
223
+ },
224
+ "language_info": {
225
+ "codemirror_mode": {
226
+ "name": "ipython",
227
+ "version": 3
228
+ },
229
+ "file_extension": ".py",
230
+ "mimetype": "text/x-python",
231
+ "name": "python",
232
+ "nbconvert_exporter": "python",
233
+ "pygments_lexer": "ipython3",
234
+ "version": "3.12.2"
235
+ }
236
+ },
237
+ "nbformat": 4,
238
+ "nbformat_minor": 5
239
+ }