Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

bernardo-de-almeida commited on Dec 12, 2025

Commit

76bb74c

1 Parent(s): fb89b3c

uniformize notebooks

Browse files

Files changed (3) hide show

notebooks/00_quickstart_inference.ipynb +4 -14
notebooks/01_tracks_prediction.ipynb +57 -111
notebooks/02_genome_annotation.ipynb +15 -17

notebooks/00_quickstart_inference.ipynb CHANGED Viewed

@@ -23,12 +23,10 @@
     },
     {
       "cell_type": "markdown",
-      "id": "5d58bf1d",
       "metadata": {},
       "source": [
-        "## 0) ⚙️ Colab Setup (if running on Google Colab)\n",
-        "\n",
-        "This cell detects if you're running on Google Colab and sets up the environment accordingly."
       ]
     },
     {
@@ -41,14 +39,6 @@
         "!pip -q install \"transformers>=4.40\" \"huggingface_hub>=0.23\" safetensors torch numpy"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "id": "5827af7e",
-      "metadata": {},
-      "source": [
-        "## 1) 📦 Imports + setup"
-      ]
-    },
     {
       "cell_type": "code",
       "execution_count": 3,
@@ -95,7 +85,7 @@
       "id": "82146876",
       "metadata": {},
       "source": [
-        "## 2) 🎯 Pre-trained checkpoint (MLM-focused)\n",
         "\n",
         "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
         "\n",
@@ -285,7 +275,7 @@
       "id": "60a01798",
       "metadata": {},
       "source": [
-        "## 3) 🧠 Post-trained checkpoint (task heads: BigWig + BED)\n",
         "\n",
         "Post-trained checkpoints add task-specific heads for functional track prediction and genome annotation.\n",
         "\n",

     },
     {
       "cell_type": "markdown",
+      "id": "5827af7e",
       "metadata": {},
       "source": [
+        "## 0) 📦 Imports + setup"
       ]
     },
     {
         "!pip -q install \"transformers>=4.40\" \"huggingface_hub>=0.23\" safetensors torch numpy"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": 3,
       "id": "82146876",
       "metadata": {},
       "source": [
+        "## 1) 🎯 Pre-trained checkpoint (MLM-focused)\n",
         "\n",
         "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
         "\n",
       "id": "60a01798",
       "metadata": {},
       "source": [
+        "## 2) 🧠 Post-trained checkpoint (task heads: BigWig + BED)\n",
         "\n",
         "Post-trained checkpoints add task-specific heads for functional track prediction and genome annotation.\n",
         "\n",

notebooks/01_tracks_prediction.ipynb CHANGED Viewed

@@ -35,21 +35,20 @@
         "- Supports the 24 species that NTv3 was post-trained on"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
       "id": "0ff509fd",
       "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\u001b[33mWARNING: 401 Error, Credentials not correct for https://gitlab.com/api/v4/projects/36813343/packages/pypi/simple/seaborn/\u001b[0m\u001b[33m\n",
-            "\u001b[0m"
-          ]
-        }
-      ],
       "source": [
         "# Install dependencies\n",
         "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib"
@@ -57,7 +56,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 7,
       "id": "608d67e1",
       "metadata": {},
       "outputs": [],
@@ -76,35 +75,48 @@
         "import seaborn as sns"
       ]
     },
     {
       "cell_type": "markdown",
       "id": "19db4774",
       "metadata": {},
       "source": [
-        "## 1) 📦 Imports + configuration\n",
         "\n",
         "Set your NTv3 model and genomic window here"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 8,
       "id": "795a576f",
       "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "window length: 131072\n"
-          ]
-        }
-      ],
       "source": [
         "# -----------------------------\n",
         "# User inputs\n",
         "# -----------------------------\n",
-        "model_name = \"InstaDeepAI/NTv3_100M_pos\" # options: \"InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb\" or \"InstaDeepAI/ntv3_650M_7downsample_post_trained_1mb_v2\"\n",
         "\n",
         "# Example window from a given species (edit these) - needs to be multiple of 128 due to the model downsampling\n",
         "species = \"human\"  # will use for condition the model on species\n",
@@ -114,40 +126,7 @@
         "end   = 6_831_072\n",
         "\n",
         "# Optional\n",
-        "HF_TOKEN = os.getenv(\"HF_TOKEN\", None)\n",
-        "\n",
-        "assert end > start, \"end must be > start\"\n",
-        "window_len = end - start\n",
-        "assert window_len % 128 == 0, f\"window length ({window_len}) must be a multiple of 128\"\n",
-        "print(\"window length:\", window_len)\n",
-        "\n",
-        "# Simple DNA sanitization\n",
-        "DNA_RE = re.compile(r\"^[ACGTNacgtn]+$\")\n",
-        "def sanitize_dna(seq: str) -> str:\n",
-        "    seq = seq.upper()\n",
-        "    seq = re.sub(r\"[^ACGTN]\", \"N\", seq)\n",
-        "    return seq"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 4,
-      "id": "2354e2aa",
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "device: cpu dtype: torch.float16\n"
-          ]
-        }
-      ],
-      "source": [
-        "# Device\n",
-        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
-        "dtype = torch.bfloat16 if (device == \"cuda\" and torch.cuda.get_device_capability(0)[0] >= 8) else torch.float16\n",
-        "print(\"device:\", device, \"dtype:\", dtype)"
       ]
     },
     {
@@ -160,62 +139,28 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
-      "id": "8c20066a",
       "metadata": {},
       "outputs": [
         {
           "name": "stdout",
           "output_type": "stream",
           "text": [
-            "Downloading: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr19.fa.gz\n",
-            "Using downloaded chromosome FASTA: ./hg38/chr19.fa\n",
-            "Sequence preview: GTCAACAATAACAAATGACATATTAGTAGTAAATTATAATTATACATTACAACAAAATTA...\n",
-            "Valid DNA: True\n"
           ]
         }
       ],
       "source": [
-        "def download_ucsc_chrom_fasta(chrom: str, assembly: str, out_dir: str = f\"./{assembly}\") -> str:\n",
-        "    \"\"\"Download a single chromosome FASTA from UCSC and return local path.\"\"\"\n",
-        "    os.makedirs(out_dir, exist_ok=True)\n",
-        "    gz_path = os.path.join(out_dir, f\"{chrom}.fa.gz\")\n",
-        "    fa_path = os.path.join(out_dir, f\"{chrom}.fa\")\n",
-        "\n",
-        "    if os.path.exists(fa_path):\n",
-        "        return fa_path\n",
-        "\n",
-        "    # UCSC chrom fasta (chromFa/)\n",
-        "    url = f\"https://hgdownload.soe.ucsc.edu/goldenPath/{assembly}/chromosomes/{chrom}.fa.gz\"\n",
-        "    print(\"Downloading:\", url)\n",
-        "    r = requests.get(url, stream=True)\n",
-        "    r.raise_for_status()\n",
-        "    with open(gz_path, \"wb\") as f:\n",
-        "        for chunk in r.iter_content(chunk_size=1024 * 1024):\n",
-        "            if chunk:\n",
-        "                f.write(chunk)\n",
-        "\n",
-        "    # Decompress\n",
-        "    import gzip\n",
-        "    with gzip.open(gz_path, \"rb\") as fin, open(fa_path, \"wb\") as fout:\n",
-        "        fout.write(fin.read())\n",
-        "\n",
-        "    return fa_path\n",
-        "\n",
-        "def fetch_window_sequence(chrom: str, start: int, end: int, fasta_path: str = \"\") -> str:\n",
-        "    \"\"\"Fetch [start,end) sequence from fasta. If fasta_path is a whole genome file, chrom must match record name.\"\"\"\n",
-        "    fasta = Fasta(fasta_path, rebuild=True)\n",
-        "    seq = fasta[chrom][start:end].seq\n",
-        "    return sanitize_dna(seq)\n",
-        "\n",
-        "# Download chromosome\n",
-        "fasta_path = download_ucsc_chrom_fasta(chrom, assembly)\n",
-        "print(\"Using downloaded chromosome FASTA:\", fasta_path)\n",
-        "\n",
-        "seq = fetch_window_sequence(chrom, start, end, fasta_path=fasta_path)\n",
-        "print(\"Sequence preview:\", seq[:60] + (\"...\" if len(seq) > 60 else \"\"))\n",
-        "print(\"Valid DNA:\", bool(DNA_RE.match(seq)))\n",
-        "assert len(seq) == (end - start), \"Fetched sequence length mismatch\""
       ]
     },
     {
@@ -228,7 +173,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 11,
       "id": "e09f0469",
       "metadata": {},
       "outputs": [
@@ -395,7 +340,7 @@
               ")"
             ]
           },
-          "execution_count": 11,
           "metadata": {},
           "output_type": "execute_result"
         }
@@ -419,7 +364,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 12,
       "id": "43154959",
       "metadata": {},
       "outputs": [
@@ -463,7 +408,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 13,
       "id": "6765a9b9",
       "metadata": {},
       "outputs": [
@@ -520,7 +465,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 14,
       "id": "a26e9dcc",
       "metadata": {},
       "outputs": [],
@@ -537,7 +482,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 15,
       "id": "717539e2",
       "metadata": {},
       "outputs": [],
@@ -582,7 +527,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 16,
       "id": "7ba9a397",
       "metadata": {},
       "outputs": [
@@ -620,6 +565,7 @@
         "\n",
         "# Model predicts for middle 37.5% of input sequence\n",
         "# So predictions start at: start + (window_len - window_len * 0.375) / 2 = start + window_len * 0.3125\n",
         "prediction_start = start + int(window_len * 0.3125)\n",
         "prediction_end = prediction_start + int(window_len * 0.375)\n",
         "x = np.arange(prediction_start, prediction_end)\n",

         "- Supports the 24 species that NTv3 was post-trained on"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "id": "77046e68",
+      "metadata": {},
+      "source": [
+        "## 0) 📦 Imports + setup"
+      ]
+    },
     {
       "cell_type": "code",
+      "execution_count": 1,
       "id": "0ff509fd",
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Install dependencies\n",
         "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib"
     },
     {
       "cell_type": "code",
+      "execution_count": 2,
       "id": "608d67e1",
       "metadata": {},
       "outputs": [],
         "import seaborn as sns"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "id": "2354e2aa",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "device: cpu dtype: torch.float16\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Device\n",
+        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+        "dtype = torch.bfloat16 if (device == \"cuda\" and torch.cuda.get_device_capability(0)[0] >= 8) else torch.float16\n",
+        "print(\"device:\", device, \"dtype:\", dtype)"
+      ]
+    },
     {
       "cell_type": "markdown",
       "id": "19db4774",
       "metadata": {},
       "source": [
+        "## 1) 📦 Configuration\n",
         "\n",
         "Set your NTv3 model and genomic window here"
       ]
     },
     {
       "cell_type": "code",
+      "execution_count": 4,
       "id": "795a576f",
       "metadata": {},
+      "outputs": [],
       "source": [
         "# -----------------------------\n",
         "# User inputs\n",
         "# -----------------------------\n",
+        "model_name = \"InstaDeepAI/NTv3_100M_pos\" # options: \"InstaDeepAI/NTv3_100M_pos\" or \"InstaDeepAI/NTv3_650M_pos\"\n",
         "\n",
         "# Example window from a given species (edit these) - needs to be multiple of 128 due to the model downsampling\n",
         "species = \"human\"  # will use for condition the model on species\n",
         "end   = 6_831_072\n",
         "\n",
         "# Optional\n",
+        "HF_TOKEN = os.getenv(\"HF_TOKEN\", None)"
       ]
     },
     {
     },
     {
       "cell_type": "code",
+      "execution_count": 5,
+      "id": "2e0026e4",
       "metadata": {},
       "outputs": [
         {
           "name": "stdout",
           "output_type": "stream",
           "text": [
+            "Original sequence length: 131072\n",
+            "Cropped sequence length: 131072, 1024.0 transformer tokens\n"
           ]
         }
       ],
       "source": [
+        "# Get the sequence from the UCSC API\n",
+        "url = f\"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}\"\n",
+        "seq = requests.get(url).json()[\"dna\"].upper()\n",
+        "print(f\"Original sequence length: {len(seq)}\")\n",
+        "\n",
+        "# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)\n",
+        "seq = seq[:int(len(seq) // 128) * 128]\n",
+        "print(f\"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens\")"
       ]
     },
     {
     },
     {
       "cell_type": "code",
+      "execution_count": 6,
       "id": "e09f0469",
       "metadata": {},
       "outputs": [
               ")"
             ]
           },
+          "execution_count": 6,
           "metadata": {},
           "output_type": "execute_result"
         }
     },
     {
       "cell_type": "code",
+      "execution_count": 7,
       "id": "43154959",
       "metadata": {},
       "outputs": [
     },
     {
       "cell_type": "code",
+      "execution_count": 8,
       "id": "6765a9b9",
       "metadata": {},
       "outputs": [
     },
     {
       "cell_type": "code",
+      "execution_count": 9,
       "id": "a26e9dcc",
       "metadata": {},
       "outputs": [],
     },
     {
       "cell_type": "code",
+      "execution_count": 10,
       "id": "717539e2",
       "metadata": {},
       "outputs": [],
     },
     {
       "cell_type": "code",
+      "execution_count": 12,
       "id": "7ba9a397",
       "metadata": {},
       "outputs": [
         "\n",
         "# Model predicts for middle 37.5% of input sequence\n",
         "# So predictions start at: start + (window_len - window_len * 0.375) / 2 = start + window_len * 0.3125\n",
+        "window_len = end - start\n",
         "prediction_start = start + int(window_len * 0.3125)\n",
         "prediction_end = prediction_start + int(window_len * 0.375)\n",
         "x = np.arange(prediction_start, prediction_end)\n",

notebooks/02_genome_annotation.ipynb CHANGED Viewed

@@ -11,19 +11,17 @@
     "\n",
     "The pipeline abstracts away all the underlying steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation. It returns a ready-to-use GFF file that can be visualized in any genome browser for the sequence of interest.\n",
     "\n",
-    "If you’re interested in exploring the intermediate probabilities, please refer to the track-prediction notebooks. These probabilities can be useful for assessing model confidence and identifying potentially interesting biological regions. This notebook focuses on the higher-level task of producing gene annotations directly from raw DNA.\n",
     "\n",
     "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "71fac239",
    "metadata": {},
    "source": [
-    "## 0) Colab Setup (if running on Google Colab)\n",
-    "\n",
-    "This cell detects if you're running on Google Colab and sets up the environment accordingly."
    ]
   },
   {
@@ -37,16 +35,6 @@
     "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "36d32e97",
-   "metadata": {},
-   "source": [
-    "## 1) 📦 Imports + configuration\n",
-    "\n",
-    "Set your NTv3 model and genomic window here"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -61,6 +49,16 @@
     "from transformers import pipeline"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -69,7 +67,7 @@
    "outputs": [],
    "source": [
     "# Define the model and genomic window\n",
-    "model_name = \"InstaDeepAI/NTv3_650M\"\n",
     "assembly = \"hg38\"\n",
     "chrom = \"chr19\"\n",
     "start = 6_700_000\n",
@@ -98,7 +96,7 @@
     "\n",
     "# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)\n",
     "seq = seq[:int(len(seq) // 128) * 128]\n",
-    "print(f\"Cropped sequence length: {len(seq)}, {len(seq) / 128} tokens\")"
    ]
   },
   {

     "\n",
     "The pipeline abstracts away all the underlying steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation. It returns a ready-to-use GFF file that can be visualized in any genome browser for the sequence of interest.\n",
     "\n",
+    "If you’re interested in exploring the intermediate probabilities, please refer to the [track-prediction notebook](https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks/01_tracks_prediction.ipynb). These probabilities can be useful for assessing model confidence and identifying potentially interesting biological regions. This notebook focuses on the higher-level task of producing gene annotations directly from raw DNA.\n",
     "\n",
     "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "94c46695",
    "metadata": {},
    "source": [
+    "## 0) 📦 Imports + setup"
    ]
   },
   {
     "!pip -q install \"transformers>=4.55\" \"huggingface_hub>=0.23\" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
     "from transformers import pipeline"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9d29bb77",
+   "metadata": {},
+   "source": [
+    "## 1) 📦 Configuration\n",
+    "\n",
+    "Set your NTv3 model and genomic window here"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "outputs": [],
    "source": [
     "# Define the model and genomic window\n",
+    "model_name = \"InstaDeepAI/NTv3_650M_pos\"\n",
     "assembly = \"hg38\"\n",
     "chrom = \"chr19\"\n",
     "start = 6_700_000\n",
     "\n",
     "# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)\n",
     "seq = seq[:int(len(seq) // 128) * 128]\n",
+    "print(f\"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens\")"
    ]
   },
   {