{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "024bb8a8",
      "metadata": {},
      "source": [
        "# 🚀 NTv3 Quickstart — Pre-trained and Post-trained models\n",
        "\n",
        "This notebook demonstrates how to run **quick inference** with both the pre- and post-trained NTv3 checkpoints:\n",
        "\n",
        "- **Pre-trained (MLM-focused):** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre`\n",
        "- **Post-trained (functional tracks and genome annotation):** `InstaDeepAI/NTv3_100M_pos`, `InstaDeepAI/NTv3_650M_pos`\n",
        "\n",
        "We show how to:\n",
        "\n",
        "1. Load tokenizers + models\n",
        "2. Run a forward pass on a DNA sequence window\n",
        "3. Inspect key outputs\n",
        "\n",
        "> 📝 **Note for Google Colab users:** This notebook is compatible with Colab! For faster inference, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5827af7e",
      "metadata": {},
      "source": [
        "## 0) 📦 Imports + setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "38cc32a9",
      "metadata": {},
      "outputs": [],
      "source": [
        "!pip -q install \"transformers>=4.40\" \"huggingface_hub>=0.23\" safetensors torch numpy"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "d56c105b",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "device: cpu\n",
            "torch_dtype: torch.float32\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "import torch\n",
        "import numpy as np\n",
        "\n",
        "from transformers import AutoConfig, AutoModel, AutoTokenizer, AutoModelForMaskedLM\n",
        "\n",
        "# Optional: if the model is gated/private, set HF_TOKEN to a PERSONAL token (hf_...)\n",
        "HF_TOKEN = os.getenv(\"HF_TOKEN\", None)\n",
        "\n",
        "# -----------------------------\n",
        "# Device\n",
        "# -----------------------------\n",
        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
        "print(\"device:\", device)\n",
        "\n",
        "# Choose dtype (bf16 if supported; else fp16 on GPU; else fp32)\n",
        "if device == \"cuda\":\n",
        "    major, minor = torch.cuda.get_device_capability(0)\n",
        "    torch_dtype = torch.bfloat16 if major >= 8 else torch.float16\n",
        "else:\n",
        "    torch_dtype = torch.float32\n",
        "\n",
        "print(\"torch_dtype:\", torch_dtype)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "82146876",
      "metadata": {},
      "source": [
        "## 1) 🎯 Pre-trained checkpoint (MLM-focused)\n",
        "\n",
        "This shows the simplest usage: load model + tokenizer, then run a forward pass.\n",
        "\n",
        "Expected output:\n",
        "- `logits`: masked language modeling logits"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "336bb40c",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "411ee47e94ae467f9685c35b65e3e52d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "30447edb44b849bd936290f3a6b1b863",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenization_ntv3.py:   0%|          | 0.00/12.0k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/ntv3_base_model:\n",
            "- tokenization_ntv3.py\n",
            ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "766f183dcc84421588e5cf0241d3efe7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "vocab.json:   0%|          | 0.00/138 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b0db83f7cb824d3288a30bebf7891a63",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "special_tokens_map.json:   0%|          | 0.00/149 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "33cf5391dcc549f088e4e927651d1cdb",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "config.json:   0%|          | 0.00/1.70k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "85772d5369234ca286cfa518e1725b12",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "configuration_ntv3.py:   0%|          | 0.00/5.90k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/ntv3_base_model:\n",
            "- configuration_ntv3.py\n",
            ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "ec1153d073e444c5b255ee5adea6ba68",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "modeling_ntv3_base.py:   0%|          | 0.00/33.9k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/ntv3_base_model:\n",
            "- modeling_ntv3_base.py\n",
            ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "94b9bb7fe0da4f4994adb9127d9af7e6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "model.safetensors:   0%|          | 0.00/30.8M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "torch.Size([2, 128, 11])\n",
            "16\n",
            "2\n",
            "MLM logits shape: (2, 128, 11)\n"
          ]
        }
      ],
      "source": [
        "pretrained_model_name = \"InstaDeepAI/NTv3_8M_pre\"\n",
        "\n",
        "# Load tokenizer/model\n",
        "tok_pre = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)\n",
        "model_pre = AutoModelForMaskedLM.from_pretrained(pretrained_model_name, trust_remote_code=True)\n",
        "\n",
        "# Example: human sequence\n",
        "seqs = [\"ATCGNATCG\", \"ACGT\"]\n",
        "batch = tok_pre(seqs, add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors=\"pt\")\n",
        "out = model_pre(**batch, output_hidden_states=True, output_attentions=True)\n",
        "\n",
        "print(out.logits.shape)       # (B, L, V = 11)\n",
        "print(len(out.hidden_states)) # convs + transformers + deconvs\n",
        "print(len(out.attentions))\n",
        "\n",
        "# Access MLM logits\n",
        "mlm_logits = out[\"logits\"]\n",
        "print(\"MLM logits shape:\", tuple(mlm_logits.shape))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "60a01798",
      "metadata": {},
      "source": [
        "## 2) 🧠 Post-trained checkpoint (task heads: BigWig + BED)\n",
        "\n",
        "Post-trained checkpoints add task-specific heads for functional track prediction and genome annotation.\n",
        "\n",
        "In particular:\n",
        "- `species_tokenizer` is used to tokenize a species condition like `\"human\"`\n",
        "- `species_ids` passes the species tokens to the model\n",
        "\n",
        "Expected outputs:\n",
        "- `bigwig_tracks_logits`: functional track predictions\n",
        "- `bed_tracks_logits`: genome annotation predictions\n",
        "- `logits`: masked language modeling logits"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "bdb8c4d1",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Model supported species: TO BE DONE\n"
          ]
        }
      ],
      "source": [
        "# Inspect config and supported species\n",
        "post_trained_model_name = \"InstaDeepAI/NTv3_100M_pos\"\n",
        "\n",
        "cfg_post = AutoConfig.from_pretrained(post_trained_model_name, trust_remote_code=True)\n",
        "\n",
        "species = \"TO BE DONE\"\n",
        "print(\"Model supported species:\", species)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6cc5f2df",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "torch.Size([1, 768, 7362])\n",
            "torch.Size([1, 768, 21, 2])\n",
            "torch.Size([1, 2048, 11])\n"
          ]
        }
      ],
      "source": [
        "tok_post = AutoTokenizer.from_pretrained(post_trained_model_name, trust_remote_code=True)\n",
        "cond_tok_post = AutoTokenizer.from_pretrained(post_trained_model_name, subfolder='species_tokenizer', trust_remote_code=True)\n",
        "model_post = AutoModel.from_pretrained(post_trained_model_name, trust_remote_code=True)\n",
        "\n",
        "# Prepare inputs\n",
        "batch = tok_post([\"ATCGNATCG\", \"ACGT\"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors=\"pt\")\n",
        "\n",
        "# Condition tokens (e.g., species)\n",
        "species = 'human'\n",
        "species_ids = cond_tok_post([species] * len(batch['input_ids']), add_special_tokens=False, return_tensors='pt')\n",
        "\n",
        "# Forward pass\n",
        "out = model_post(\n",
        "    input_ids=batch[\"input_ids\"],\n",
        "    species_ids=species_ids['input_ids'],\n",
        "    return_dict=True\n",
        ")\n",
        "\n",
        "# 7k human tracks over 37.5 % center region of the input sequence\n",
        "print(\"bigwig_tracks_logits:\", tuple(out[\"bigwig_tracks_logits\"].shape))\n",
        "# Location of 21 genomic elements over 37.5 % center region of the input sequence\n",
        "print(\"bed_tracks_logits:\", tuple(out[\"bed_tracks_logits\"].shape))\n",
        "# Language model logits for whole sequence over vocabulary\n",
        "print(\"language model logits:\", tuple(out[\"logits\"].shape))\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "hf-finetune",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.18"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}