Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

tpierrot commited on Dec 11, 2025

Commit

2101d19

verified ·

1 Parent(s): 85f3120

Upload 02_genome_annotation.ipynb

Browse files

Files changed (1) hide show

notebooks/02_genome_annotation.ipynb +239 -0

notebooks/02_genome_annotation.ipynb ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1ee06421",
+   "metadata": {},
+   "source": [
+    "# 🧬 NTv3 Post-Trained Genome Annotation\n",
+    "\n",
+    "This notebook demonstrates how to use the NTv3 post-trained model to perform genome annotation directly from a DNA sequence. It relies on a pipeline that applies a Hidden Markov Model (HMM) to the per-base probabilities returned by NTv3, converting them into a coherent gene model that respects biological constraints and valid transitions between genomic elements.\n",
+    "\n",
+    "The pipeline abstracts away all the underlying steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation. It returns a ready-to-use GFF file that can be visualized in any genome browser for the sequence of interest.\n",
+    "\n",
+    "If you’re interested in exploring the intermediate probabilities, please refer to the track-prediction notebooks. These probabilities can be useful for assessing model confidence and identifying potentially interesting biological regions. This notebook focuses on the higher-level task of producing gene annotations directly from raw DNA.\n",
+    "\n",
+    "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71fac239",
+   "metadata": {},
+   "source": [
+    "## 0) Colab Setup (if running on Google Colab)\n",
+    "\n",
+    "This cell detects if you're running on Google Colab and sets up the environment accordingly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e2f5963",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies\n",
+    "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36d32e97",
+   "metadata": {},
+   "source": [
+    "## 1) 📦 Imports + configuration\n",
+    "\n",
+    "Set your NTv3 model and genomic window here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f0a8e73",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "import time\n",
+    "import torch\n",
+    "import requests\n",
+    "from transformers import pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "423af70a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the model and genomic window\n",
+    "model_name = \"InstaDeepAI/NTv3_650M\"\n",
+    "assembly = \"hg38\"\n",
+    "chrom = \"chr19\"\n",
+    "start = 6_700_000\n",
+    "end = 6_831_072"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aee9541c",
+   "metadata": {},
+   "source": [
+    "## 2) 📥 Fetch chromosome sequence for the chosen window"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b34378f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get the sequence from the UCSC API\n",
+    "url = f\"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}\"\n",
+    "seq = requests.get(url).json()[\"dna\"].upper()\n",
+    "print(f\"Original sequence length: {len(seq)}\")\n",
+    "\n",
+    "# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)\n",
+    "seq = seq[:int(len(seq) // 128) * 128]\n",
+    "print(f\"Cropped sequence length: {len(seq)}, {len(seq) / 128} tokens\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "442c4b03",
+   "metadata": {},
+   "source": [
+    "## 3) ⚡ Genome annotation pipeline (pre-processing, inference, post-processing)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4857d15c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build NTv3 GFF pipeline\n",
+    "ntv3_gff = pipeline(\n",
+    "    \"ntv3-gff\",\n",
+    "    model=model_name,\n",
+    "    trust_remote_code=True,\n",
+    "    device=0 if torch.cuda.is_available() else -1,\n",
+    ")\n",
+    "\n",
+    "# Run pipeline: DNA -> NTv3 -> HMM -> GFF3\n",
+    "inputs = {\n",
+    "    \"sequence\": seq,\n",
+    "    \"chrom\": chrom,\n",
+    "    \"start\": start,\n",
+    "    \"end\": end,\n",
+    "    \"assembly\": assembly,\n",
+    "}\n",
+    "\n",
+    "# Run the pipeline\n",
+    "start_time = time.time()\n",
+    "gff_text = ntv3_gff(inputs)\n",
+    "end_time = time.time()\n",
+    "print(f\"Inference + decoding time: {end_time - start_time:.2f} seconds\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "190ff65e",
+   "metadata": {},
+   "source": [
+    "## 4) 📁 Save a GFF file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "959cf79f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save GFF3 file\n",
+    "short_model_name_match = re.search(r\"[^/]+$\", model_name)\n",
+    "short_model_name = short_model_name_match.group() if short_model_name_match else model_name\n",
+    "\n",
+    "output_filename = f\"{short_model_name}_{assembly}_{chrom}_{start}_{end}.gff3\"\n",
+    "with open(output_filename, \"w\") as output_file:\n",
+    "    output_file.write(gff_text)\n",
+    "\n",
+    "print(f\"Saved GFF file to {output_filename}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "291e0710",
+   "metadata": {},
+   "source": [
+    "## 5) 🌐 Create an IGV Browser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84f013f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import igv_notebook\n",
+    "\n",
+    "igv_notebook.init()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0904a5cb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = {\n",
+    "    \"genome\": \"hg38\",  # built-in hg38\n",
+    "    \"locus\": f\"{chrom}:{start}-{end}\",\n",
+    "}\n",
+    "\n",
+    "gff_track = {\n",
+    "    \"name\": \"NTv3 annotations\",\n",
+    "    \"format\": \"gff3\",\n",
+    "    \"type\": \"annotation\",\n",
+    "    \"url\": output_filename,  # just the filename\n",
+    "    # \"height\": 200,\n",
+    "}\n",
+    "\n",
+    "browser = igv_notebook.Browser(config)\n",
+    "browser.load_track(gff_track)\n",
+    "\n",
+    "# Re-center on the region, just to be sure\n",
+    "browser.search(f\"{chrom}:{start}-{end}\")\n",
+    "browser  # <- just return the object, no .show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}