Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

bernardo-de-almeida commited on Dec 9, 2025

Commit

3367165

1 Parent(s): a10b560

feat: add inference and track prediction notebooks

Browse files

Files changed (2) hide show

notebooks/00_quickstart_inference.ipynb +203 -11
notebooks/01_tracks_prediction.ipynb +0 -0

notebooks/00_quickstart_inference.ipynb CHANGED Viewed

@@ -2,51 +2,243 @@
   "cells": [
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# NTv3 Quickstart Inference\n",
         "\n",
-        "This notebook demonstrates how to load and run inference with NTv3 models.\n"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Install Dependencies"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Load Model and Tokenizer"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Run Inference"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Next Steps\n",
         "\n",
-        "- Try different sequences and models\n",
-        "- Explore model outputs\n",
-        "- Check out other notebooks for tracks prediction, annotation, and more\n"
       ]
     }
   ],
   "metadata": {
     "language_info": {
-      "name": "python"
     }
   },
   "nbformat": 4,
-  "nbformat_minor": 2
 }

   "cells": [
     {
       "cell_type": "markdown",
+      "id": "024bb8a8",
       "metadata": {},
       "source": [
+        "# NTv3 Quickstart — Pre-trained and Post-trained models\n",
         "\n",
+        "This notebook demonstrates how to run **quick inference** with bothe pre- and post-trained NTv3 checkpoints:\n",
+        "\n",
+        "- **Pre-trained (MLM-focused):** `InstaDeepAI/ntv3_8M_7downsample_pretrained_le_1mb`, `InstaDeepAI/ntv3_106M_7downsample_pretrained_le_1mb`, `InstaDeepAI/ntv3_650M_ntv3_650M_7downsample_pretrained_le_1mb7downsample_pre_trained_1mb`\n",
+        "- **Post-trained (task heads):** `InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb`, `InstaDeepAI/ntv3_650M_7downsample_post_trained_1mb`\n",
+        "\n",
+        "We show how to:\n",
+        "\n",
+        "1. Load tokenizers + models\n",
+        "2. Run a forward pass on a DNA sequence window\n",
+        "3. Inspect key outputs"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "5d58bf1d",
       "metadata": {},
       "source": [
+        "## 0) Install dependencies\n",
+        "\n",
+        "Skip if already installed."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "38cc32a9",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip -q install \"transformers>=4.40\" \"huggingface_hub>=0.23\" safetensors torch numpy"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "5827af7e",
+      "metadata": {},
+      "source": [
+        "## 1) Imports + setup"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "id": "d56c105b",
       "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "device: cpu\n",
+            "torch_dtype: torch.float32\n"
+          ]
+        }
+      ],
       "source": [
+        "import os\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "\n",
+        "from transformers import AutoConfig, AutoModel, AutoTokenizer, AutoModelForMaskedLM\n",
+        "\n",
+        "# Optional: if the model is gated/private, set HF_TOKEN to a PERSONAL token (hf_...)\n",
+        "HF_TOKEN = os.getenv(\"HF_TOKEN\", None)\n",
+        "\n",
+        "# -----------------------------\n",
+        "# Device\n",
+        "# -----------------------------\n",
+        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+        "print(\"device:\", device)\n",
+        "\n",
+        "# Choose dtype (bf16 if supported; else fp16 on GPU; else fp32)\n",
+        "if device == \"cuda\":\n",
+        "    major, minor = torch.cuda.get_device_capability(0)\n",
+        "    torch_dtype = torch.bfloat16 if major >= 8 else torch.float16\n",
+        "else:\n",
+        "    torch_dtype = torch.float32\n",
+        "\n",
+        "print(\"torch_dtype:\", torch_dtype)"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "82146876",
+      "metadata": {},
+      "source": [
+        "## 2) Pre-trained checkpoint (MLM-focused)\n",
+        "\n",
+        "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
+        "\n",
+        "Expected output:\n",
+        "- `logits`: masked language modeling logits"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "336bb40c",
       "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "torch.Size([2, 128, 11])\n",
+            "16\n",
+            "2\n",
+            "MLM logits shape: (2, 128, 11)\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/opt/anaconda3/envs/hf-finetune/lib/python3.10/site-packages/torch/amp/autocast_mode.py:283: UserWarning: In CPU autocast, but the target dtype is not supported. Disabling autocast.\n",
+            "CPU Autocast only supports dtype of torch.bfloat16, torch.float16 currently.\n",
+            "  warnings.warn(error_message)\n"
+          ]
+        }
+      ],
       "source": [
+        "pretrained_model_name = \"InstaDeepAI/ntv3_8M_7downsample_pretrained_le_1mb\"\n",
+        "\n",
+        "# Load tokenizer/model\n",
+        "tok_pre = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)\n",
+        "model_pre = AutoModelForMaskedLM.from_pretrained(pretrained_model_name, trust_remote_code=True)\n",
+        "\n",
+        "# Example: human sequence\n",
+        "seqs = [\"ATCGNATCG\", \"ACGT\"]\n",
+        "batch = tok_pre(seqs, add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors=\"pt\")\n",
+        "out = model_pre(**batch, output_hidden_states=True, output_attentions=True)\n",
+        "\n",
+        "print(out.logits.shape)       # (B, L, V = 11)\n",
+        "print(len(out.hidden_states)) # convs + transformers + deconvs\n",
+        "print(len(out.attentions))\n",
+        "\n",
+        "# Access MLM logits\n",
+        "mlm_logits = out[\"logits\"]\n",
+        "print(\"MLM logits shape:\", tuple(mlm_logits.shape))"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "60a01798",
+      "metadata": {},
+      "source": [
+        "## 3) Post-trained checkpoint (task heads: BigWig + BED)\n",
+        "\n",
+        "Post-trained checkpoints add task-specific heads.\n",
+        "\n",
+        "In particular:\n",
+        "- `condition_tokenizer` is used to tokenize a species condition like `\"human\"`\n",
+        "- `file_assembly_idx` selects the assembly (e.g., `hg38`)\n",
+        "\n",
+        "Expected outputs:\n",
+        "- `bigwig_tracks_logits`\n",
+        "- `bed_tracks_logits`\n",
+        "- `logits` (MLM)\n",
+        "\n",
+        "> If your post-trained checkpoint supports multiple assemblies, the config typically exposes a mapping like `cfg.bigwigs_per_file_assembly`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "6cc5f2df",
       "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "torch.Size([1, 768, 7362])\n",
+            "torch.Size([1, 768, 21, 2])\n",
+            "torch.Size([1, 2048, 11])\n"
+          ]
+        }
+      ],
       "source": [
+        "posttrained_model_name = \"InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb\"\n",
         "\n",
+        "# Load config/tokenizers/model\n",
+        "cfg_pos = AutoConfig.from_pretrained(posttrained_model_name, trust_remote_code=True)\n",
+        "tok_pos = AutoTokenizer.from_pretrained(posttrained_model_name, trust_remote_code=True)\n",
+        "model_pos = AutoModel.from_pretrained(posttrained_model_name, trust_remote_code=True)\n",
+        "condition_tokenizer = AutoTokenizer.from_pretrained(\n",
+        "    posttrained_model_name, subfolder=\"condition_tokenizer\", trust_remote_code=True\n",
+        ")\n",
+        "\n",
+        "# Example: human sequence (sequence needs to be multiple of 128 due to 7 downsampling in model)\n",
+        "seq = \"ATCG\" * 512\n",
+        "batch = tok_pos([seq], add_special_tokens=False, return_tensors=\"pt\")\n",
+        "condition = condition_tokenizer([\"human\"], return_tensors=\"pt\")\n",
+        "\n",
+        "# Get assembly index for human (hg38)\n",
+        "assemblies = list(cfg_pos.bigwigs_per_file_assembly.keys())\n",
+        "assembly_idx = torch.tensor([assemblies.index(\"hg38\")])\n",
+        "\n",
+        "out = model_pos(\n",
+        "    input_ids=batch[\"input_ids\"],\n",
+        "    condition_ids=[condition[\"input_ids\"][0]],\n",
+        "    file_assembly_idx=assembly_idx,\n",
+        "    output_hidden_states=True,\n",
+        "    output_attentions=True,\n",
+        ")\n",
+        "\n",
+        "# Access model outputs\n",
+        "print(out[\"bigwig_tracks_logits\"].shape)  # per-assembly functional track predictions\n",
+        "print(out[\"bed_tracks_logits\"].shape)     # genomic element classifications\n",
+        "print(out[\"logits\"].shape)                # masked LM logits"
       ]
     }
   ],
   "metadata": {
+    "kernelspec": {
+      "display_name": "hf-finetune",
+      "language": "python",
+      "name": "python3"
+    },
     "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
     }
   },
   "nbformat": 4,
+  "nbformat_minor": 5
 }

notebooks/01_tracks_prediction.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff