Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

ybornachot commited on Dec 22, 2025

Commit

11ccfa8

1 Parent(s): 8ade038

refactor: cleaning

Browse files

Files changed (1) hide show

notebooks_tutorials/02_fine_tuning.ipynb +149 -244

notebooks_tutorials/02_fine_tuning.ipynb CHANGED Viewed

@@ -8,31 +8,25 @@
     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
-    "We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
     "\n",
     "**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
     "- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
     "- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward\n",
     "\n",
-    "**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time).\n",
-    "\n",
-    "**⚠️ Important Note on Hardware Requirements**: While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup. If you want to reach similar performance levels, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
-    "\n",
-    "The pipeline walks through the complete fine-tuning workflow:\n",
-    "- Loading genomic FASTA files sequences and their corresponding BigWig signal tracks from Hugging Face dataset\n",
-    "- Setting up a PyTorch dataset with proper train/validation/test splits\n",
-    "- Configuring the model architecture with a custom linear head\n",
-    "- Implementing a training loop with appropriate loss functions and evaluation metrics\n",
-    "- Evaluation of the fine-tuned model on the test set\n",
-    "\n",
-    "This provides a clean interface for fine-tuning and evaluation.\n",
-    "\n",
-    "The model architecture consists of a pre-trained NTv3 backbone that processes DNA sequences and a custom linear head that predicts BigWig signal values at single-nucleotide resolution. Predictions are center-cropped to focus on the central portion of the input sequence (configurable via `keep_target_center_fraction`), which helps reduce edge effects from sequence context windows. The training uses a Poisson-Multinomial loss function that captures both the scale and shape of the signal distributions, and evaluation is performed using Pearson correlation metrics on both scaled and raw predictions.\n",
     "\n",
-    "If you're interested in using pre-trained models for inference without fine-tuning, or exploring different model architectures, please refer to other notebooks in this collection. This notebook focuses specifically on the simplified fine-tuning process, which is useful when you want to quickly adapt a pre-trained model to genomic tracks or improve performance on particular cell types or experimental conditions.\n",
     "\n",
-    "📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended). However, keep in mind that the timing benchmarks mentioned above were obtained on much more powerful hardware (H100 GPU), so your training times on Colab may be significantly longer."
    ]
   },
   {
@@ -58,7 +52,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Standard library imports\n",
     "import functools\n",
     "import fnmatch\n",
     "import os\n",
@@ -66,7 +59,6 @@
     "from pathlib import Path\n",
     "from typing import Callable, Dict, List\n",
     "\n",
-    "# Third-party imports\n",
     "from huggingface_hub import HfApi, snapshot_download\n",
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
@@ -88,15 +80,10 @@
    "metadata": {},
    "source": [
     "# 1. ⚙️ Configuration\n",
-    " \n",
-    "💡 **Tip:** The parameters below are pre-configured for minimal requirements and are suitable for running on a Colab GPU, but this may come at the cost of reduced model performance or slower training.  \n",
-    " \n",
-    "Feel free to experiment with these parameters according to your available resources:\n",
-    "- If you have a more powerful GPU, **increase** `batch_size`, `learning_rate`, and `num_steps_training` for better performance and more robust training results.\n",
-    "- To speed up training (especially during data loading), consider increasing the `num_workers` value if memory and CPU resources allow.\n",
-    "\n",
-    "Current configuration allow to reach decent performances and completes training in ~1h30 on a colab environment with one T4 GPU and 2CPUs. \n",
     "\n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
@@ -204,7 +191,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -279,8 +266,6 @@
     "    # FASTA file\n",
     "    fasta_path_repo = f\"{species}/genome.fasta\"\n",
     "    fasta_path = str(local_dir / fasta_path_repo)\n",
-    "    if not Path(fasta_path).is_file():\n",
-    "        raise ValueError(f\"FASTA file not found at '{fasta_path}'\")\n",
     "    \n",
     "    # BigWig files - use downloaded files directly\n",
     "    bigwig_dir = local_dir / species / \"functional_tracks\"\n",
@@ -296,8 +281,7 @@
     "    # Splits file\n",
     "    splits_path_repo = f\"{species}/splits.bed\"\n",
     "    splits_path = local_dir / splits_path_repo\n",
-    "    if not splits_path.is_file():\n",
-    "        raise ValueError(f\"Splits file not found at '{splits_path}'\")\n",
     "    splits_df = pd.read_csv(\n",
     "        splits_path, \n",
     "        sep=\"\\t\", \n",
@@ -311,7 +295,7 @@
     "    metadata_df = pd.read_csv(metadata_path, sep=\"\\t\")\n",
     "\n",
     "    # Filter metadata according to species\n",
-    "    metadata_df = metadata_df[metadata_df[\"species\"] == species].reset_index(drop=True)\n",
     "\n",
     "    # Order metadata according to bigwig file ids\n",
     "    metadata_df = (\n",
@@ -367,23 +351,24 @@
    "source": [
     "# 3. 🧠 Model and tokenizer setup\n",
     " \n",
-    "In this section, we set up the model and tokenizer. \n",
-    " \n",
-    "Our approach uses any suitable pretrained backbone from HuggingFace Transformers (for example, `InstaDeepAI/ntv3_650M_pre`),\n",
-    "which is then extended with an additional linear head. \n",
-    " \n",
-    "This linear head is trained for regression on a set of genomic tracks, \n",
-    "allowing the model to make predictions for each track at single nucleotide resolution.\n",
-    " \n",
-    "The following code wraps the HuggingFace model together with this regression head for the end-to-end task.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
    "metadata": {},
    "outputs": [],
    "source": [
     "class LinearHead(nn.Module):\n",
     "    \"\"\"A linear head that predicts one scalar value per track.\"\"\"\n",
     "    def __init__(self, embed_dim: int, num_labels: int):\n",
@@ -419,11 +404,7 @@
     "        self.backbone = torch.compile(backbone)\n",
     "        \n",
     "        self.keep_target_center_fraction = keep_target_center_fraction\n",
-    "\n",
-    "        if hasattr(self.config, \"embed_dim\"):\n",
-    "            embed_dim = self.config.embed_dim\n",
-    "        else:\n",
-    "            raise ValueError(f\"Could not determine embed_dim for {model_name}\")\n",
     "        \n",
     "        # Bigwig head (NTv3 outputs at single-nucleotide resolution)\n",
     "        self.bigwig_head = LinearHead(embed_dim, len(bigwig_track_names))\n",
@@ -436,10 +417,7 @@
     "        \n",
     "        # Crop to center fraction\n",
     "        if self.keep_target_center_fraction < 1.0:\n",
-    "            seq_len = embedding.shape[1]\n",
-    "            target_offset = int(seq_len * (1 - self.keep_target_center_fraction) // 2)\n",
-    "            target_length = seq_len - 2 * target_offset\n",
-    "            embedding = embedding[:, target_offset:target_offset + target_length, :]\n",
     "        \n",
     "        # Predict bigwig tracks\n",
     "        bigwig_logits = self.bigwig_head(embedding)\n",
@@ -449,7 +427,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
@@ -473,7 +451,6 @@
     "    keep_target_center_fraction=config[\"keep_target_center_fraction\"],\n",
     ")\n",
     "model = model.to(device)\n",
-    "model.train()\n",
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(bigwig_ids)}\")\n",
@@ -498,7 +475,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -539,8 +516,7 @@
     "            _bigwig_cache[cache_key] = pyBigWig.open(abs_path)\n",
     "        except Exception as e:\n",
     "            raise RuntimeError(\n",
-    "                f\"Failed to open BigWig file: {abs_path}\\n\"\n",
-    "                f\"Error: {str(e)}\\n\"\n",
     "                f\"File exists: {Path(abs_path).exists()}\\n\"\n",
     "                f\"File size: {Path(abs_path).stat().st_size if Path(abs_path).exists() else 'N/A'} bytes\"\n",
     "            ) from e\n",
@@ -550,38 +526,10 @@
     "\n",
     "class GenomeBigWigDataset(Dataset):\n",
     "    \"\"\"\n",
-    "    Random genomic windows from a reference genome + bigWig signal.\n",
-    "\n",
-    "    Each sample:\n",
-    "        - picks a random region from the specified split,\n",
-    "        - picks a random window of length `sequence_length` within that region,\n",
-    "        - returns (sequence, signal, chrom, start, end).\n",
-    "\n",
-    "    This dataset is compatible with multi-worker DataLoaders. BigWig files\n",
-    "    are opened lazily using a process-local cache, ensuring each worker process\n",
-    "    has its own file handles and avoiding concurrent access issues.\n",
-    "\n",
-    "    Args\n",
-    "    ----\n",
-    "    fasta_path : str\n",
-    "        Path to the reference genome FASTA (e.g. hg38.fna).\n",
-    "    bigwig_path_list : list[str]\n",
-    "        List of paths to bigWig files.\n",
-    "    chrom_regions : pd.DataFrame\n",
-    "        DataFrame with columns: chr_name, start, end, split.\n",
-    "        Contains all genomic regions with their split assignments.\n",
-    "    split : str\n",
-    "        Split name to filter regions (e.g., \"train\", \"val\", \"test\").\n",
-    "    sequence_length : int\n",
-    "        Length of each random window (in bp).\n",
-    "    num_samples : int\n",
-    "        Number of samples the dataset will provide (len(dataset)).\n",
-    "    tokenizer : AutoTokenizer\n",
-    "        Tokenizer to use for tokenization.\n",
-    "    transform_fn : Callable\n",
-    "        Function to transform/scaling bigwig targets.\n",
-    "    keep_target_center_fraction : float\n",
-    "        Fraction of center sequence to keep for target prediction (crops edges to focus on center).\n",
     "    \"\"\"\n",
     "\n",
     "    def __init__(\n",
@@ -622,9 +570,6 @@
     "            # Store valid region\n",
     "            self.valid_regions.append((row.chr_name, row.start, row.end))\n",
     "\n",
-    "        if not self.valid_regions:\n",
-    "            raise ValueError(f\"No valid regions found for split '{split}'\")\n",
-    "\n",
     "    def __len__(self):\n",
     "        return self.num_samples\n",
     "\n",
@@ -664,10 +609,7 @@
     "        \n",
     "        # Crop targets to center fraction\n",
     "        if self.keep_target_center_fraction < 1.0:\n",
-    "            seq_len = bigwig_targets.shape[0]  # First dimension is sequence length\n",
-    "            target_offset = int(seq_len * (1 - self.keep_target_center_fraction) // 2)\n",
-    "            target_length = seq_len - 2 * target_offset\n",
-    "            bigwig_targets = bigwig_targets[target_offset:target_offset + target_length, :]\n",
     "\n",
     "        # Apply scaling to targets\n",
     "        bigwig_targets = self.transform_fn(bigwig_targets)\n",
@@ -691,7 +633,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -699,13 +641,7 @@
     "    metadata_df: pd.DataFrame\n",
     ") -> Callable[[torch.Tensor], torch.Tensor]:\n",
     "    \"\"\"\n",
-    "    Build a scaling function based on track means contained in the metadata.\n",
-    "\n",
-    "    Args:\n",
-    "        metadata_df: pandas.DataFrame with track means\n",
-    "\n",
-    "    Returns:\n",
-    "        Transform function that scales input tensors\n",
     "    \"\"\"\n",
     "    # Open bigwig files and compute track statistics\n",
     "    track_means = metadata_df[\"mean\"].to_numpy()\n",
@@ -716,9 +652,6 @@
     "    track_means_tensor = torch.tensor(track_means, dtype=torch.float32)\n",
     "\n",
     "    def transform_fn(x: torch.Tensor) -> torch.Tensor:\n",
-    "        \"\"\"\n",
-    "        x: torch.Tensor, shape (seq_len, num_tracks) or (batch, seq_len, num_tracks)\n",
-    "        \"\"\"\n",
     "        # Move constants to correct device then normalize\n",
     "        means = track_means_tensor.to(x.device)\n",
     "        scaled = x / means\n",
@@ -879,22 +812,30 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 25,
    "metadata": {},
    "outputs": [],
    "source": [
     "class TracksMetrics:\n",
-    "    \"\"\"Simple metrics tracker for tracks prediction.\"\"\"\n",
     "    \n",
-    "    def __init__(self, track_names: List[str]):\n",
     "        self.track_names = track_names\n",
     "        self.num_tracks = len(track_names)\n",
-    "        self.pearson_metric = PearsonCorrCoef(num_outputs=self.num_tracks).to(device)\n",
-    "        self.pearson_metric.set_dtype(torch.float64)\n",
     "        self.losses = []\n",
     "    \n",
     "    def reset(self):\n",
-    "        self.pearson_metric.reset()\n",
     "        self.losses = []\n",
     "    \n",
     "    def update(\n",
@@ -904,51 +845,70 @@
     "        loss: float\n",
     "    ):\n",
     "        \"\"\"\n",
-    "        Update metrics.\n",
-    "        Args:\n",
-    "            predictions: (batch, seq_len, num_tracks)\n",
-    "            targets: (batch, seq_len, num_tracks)\n",
-    "            loss: scalar loss value\n",
     "        \"\"\"\n",
     "        # Flatten batch and sequence dimensions\n",
-    "        pred_flat = predictions.detach().reshape(-1, self.num_tracks)  # (N, num_tracks)\n",
-    "        target_flat = targets.detach().reshape(-1, self.num_tracks)  # (N, num_tracks)\n",
-    "        \n",
-    "        # Convert to float64 for improved numerical stability in Pearson correlation\n",
-    "        pred_flat = pred_flat.to(torch.float64)\n",
-    "        target_flat = target_flat.to(torch.float64)\n",
-    "        self.pearson_metric.update(pred_flat, target_flat)\n",
     "        \n",
     "        self.losses.append(loss)\n",
     "    \n",
     "    def compute(self) -> Dict[str, float]:\n",
-    "        \"\"\"Compute and return all metrics.\"\"\"\n",
-    "        metrics_dict = {}\n",
-    "        \n",
-    "        # Compute Pearson correlation per track\n",
-    "        # Move to CPU before converting to numpy\n",
-    "        correlations = self.pearson_metric.compute().cpu().numpy()\n",
-    "        for i, track_name in enumerate(self.track_names):\n",
-    "            metrics_dict[f\"{track_name}/pearson\"] = correlations[i]\n",
-    "        \n",
-    "        # Mean Pearson correlation\n",
-    "        metrics_dict[\"mean/pearson\"] = np.nanmean(correlations)\n",
     "        \n",
     "        # Mean loss\n",
-    "        metrics_dict[\"loss\"] = np.mean(self.losses) if self.losses else 0.0\n",
     "        \n",
-    "        return metrics_dict"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
    "metadata": {},
    "outputs": [],
    "source": [
-    "train_metrics = TracksMetrics(bigwig_ids)\n",
-    "val_metrics = TracksMetrics(bigwig_ids)\n",
-    "test_metrics = TracksMetrics(bigwig_ids)"
    ]
   },
   {
@@ -962,7 +922,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -983,16 +943,8 @@
     "    epsilon: float = 1e-7,\n",
     ") -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n",
     "    \"\"\"\n",
-    "    Regression loss for bigwig tracks (Poisson-Multinomial).\n",
-    "    \n",
-    "    Args:\n",
-    "        logits: (batch, seq_length, num_tracks) - predicted counts\n",
-    "        targets: (batch, seq_length, num_tracks) - target counts\n",
-    "        shape_loss_coefficient: coefficient to weight scale loss\n",
-    "        epsilon: epsilon for numerical stability\n",
-    "    \n",
-    "    Returns:\n",
-    "        loss, scale_loss, shape_loss\n",
     "    \"\"\"\n",
     "    batch_size, seq_length, num_tracks = logits.shape\n",
     "    \n",
@@ -1044,14 +996,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
    "metadata": {},
    "outputs": [],
    "source": [
     "def train_step(\n",
     "    model: nn.Module,\n",
     "    batch: Dict[str, torch.Tensor],\n",
-    ") -> float:\n",
     "    \"\"\"Single training step.\"\"\"\n",
     "    tokens = batch[\"tokens\"].to(device)\n",
     "    bigwig_targets = batch[\"bigwig_targets\"].to(device)\n",
@@ -1065,19 +1019,27 @@
     "        logits=bigwig_logits,\n",
     "        targets=bigwig_targets,\n",
     "    )\n",
-    "    \n",
     "    # Backward pass\n",
     "    loss.backward()\n",
-    "    return loss.item()\n",
     "\n",
     "def validation_step(\n",
     "    model: nn.Module,\n",
     "    batch: Dict[str, torch.Tensor],\n",
     "    metrics: TracksMetrics,\n",
-    ") -> float:\n",
     "    \"\"\"Single validation step.\"\"\"\n",
-    "    model.eval()\n",
-    "    \n",
     "    tokens = batch[\"tokens\"].to(device)\n",
     "    bigwig_targets = batch[\"bigwig_targets\"].to(device)\n",
     "    \n",
@@ -1097,14 +1059,12 @@
     "            predictions=bigwig_logits,\n",
     "            targets=bigwig_targets,\n",
     "            loss=loss.item()\n",
-    "        )\n",
-    "    \n",
-    "    return loss.item()"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
    "metadata": {},
    "outputs": [
     {
@@ -2327,23 +2287,11 @@
    ],
    "source": [
     "# Training loop\n",
-    "print(\"Starting training...\")\n",
-    "print(f\"Training for {config['num_steps_training']} steps\\n\")\n",
-    "\n",
-    "model.train()\n",
-    "train_metrics.reset()\n",
-    "optimizer.zero_grad()  # Initialize gradients\n",
-    "\n",
-    "# Track metrics for plotting\n",
-    "train_steps = []\n",
-    "train_losses = []\n",
-    "train_pearson_scores = []\n",
-    "val_steps = []\n",
-    "val_losses = []\n",
-    "val_pearson_scores = []\n",
     "\n",
     "# Create iterator for training data (will cycle if needed)\n",
     "train_iter = iter(train_loader)\n",
     "\n",
     "# Main training loop\n",
     "for step_idx in range(config[\"num_steps_training\"]):\n",
@@ -2354,78 +2302,37 @@
     "        train_iter = iter(train_loader)\n",
     "        batch = next(train_iter)\n",
     "    \n",
-    "    # Forward pass and backward pass\n",
-    "    loss = train_step(model, batch)\n",
-    "    \n",
-    "    # Update optimizer\n",
-    "    optimizer.step()\n",
-    "    optimizer.zero_grad()\n",
-    "    \n",
-    "    # Update metrics\n",
-    "    tokens = batch[\"tokens\"].to(device)\n",
-    "    bigwig_targets = batch[\"bigwig_targets\"].to(device)\n",
-    "    with torch.no_grad():\n",
-    "        outputs = model(tokens=tokens)\n",
-    "        bigwig_logits = outputs[\"bigwig_tracks_logits\"]\n",
-    "        \n",
-    "        train_metrics.update(\n",
-    "            predictions=bigwig_logits,\n",
-    "            targets=bigwig_targets,\n",
-    "            loss=loss\n",
-    "        )\n",
-    "    \n",
     "    # Logging\n",
     "    if (step_idx + 1) % config[\"log_every_n_steps\"] == 0:\n",
-    "        train_metrics_dict = train_metrics.compute()\n",
-    "        \n",
-    "        # Get accumulated mean loss across all batches since last reset\n",
-    "        mean_loss = train_metrics_dict['loss']\n",
-    "        \n",
-    "        # Track metrics for plotting\n",
-    "        train_steps.append(step_idx + 1)\n",
-    "        train_losses.append(mean_loss)\n",
-    "        train_pearson_scores.append(train_metrics_dict['mean/pearson'])\n",
-    "        \n",
-    "        \n",
-    "        print(\n",
-    "            f\"Step {step_idx + 1}/{config['num_steps_training']} | \"\n",
-    "            f\"Loss: {mean_loss:.4f} | \"\n",
-    "            f\"Mean Pearson: {train_metrics_dict['mean/pearson']:.4f}\"\n",
-    "        )\n",
     "        train_metrics.reset()\n",
     "    \n",
     "    # Validation\n",
     "    if (step_idx + 1) % config[\"validate_every_n_steps\"] == 0:\n",
     "        print(f\"\\nRunning validation at step {step_idx + 1}...\")\n",
-    "        val_metrics.reset()\n",
     "        model.eval()\n",
     "        \n",
     "        for val_batch in val_loader:\n",
-    "            val_loss = validation_step(model, val_batch, val_metrics)\n",
-    "        \n",
-    "        # Print validation metrics\n",
-    "        val_metrics_dict = val_metrics.compute()\n",
-    "        val_pearson_mean = val_metrics_dict['mean/pearson']\n",
-    "        \n",
-    "        # Track validation metrics\n",
-    "        val_steps.append(step_idx + 1)\n",
-    "        val_losses.append(val_metrics_dict['loss'])\n",
-    "        val_pearson_scores.append(val_pearson_mean)\n",
-    "        \n",
     "        \n",
-    "        print(f\"  Validation Loss: {val_metrics_dict['loss']:.4f}\")\n",
-    "        print(f\"  Validation Mean Pearson: {val_pearson_mean:.4f}\")\n",
-    "        for track_name in bigwig_ids:\n",
-    "            print(f\"    {track_name}/pearson: {val_metrics_dict[f'{track_name}/pearson']:.4f}\")\n",
-    "        \n",
-    "        model.train()  # Back to training mode\n",
     "\n",
     "print(f\"\\nTraining completed after {config['num_steps_training']} steps.\")\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
    "metadata": {},
    "outputs": [
     {
@@ -2441,12 +2348,14 @@
    ],
    "source": [
     "# Plot training results\n",
-    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
     "\n",
     "# Plot Loss\n",
-    "axes[0].plot(train_steps, train_losses, 'b-o', label='Train Loss', markersize=4, linewidth=1.5)\n",
-    "if val_steps:\n",
-    "    axes[0].plot(val_steps, val_losses, 'r-s', label='Val Loss', markersize=4, linewidth=1.5)\n",
     "axes[0].set_xlabel('Step')\n",
     "axes[0].set_ylabel('Loss')\n",
     "axes[0].set_title('Loss')\n",
@@ -2454,17 +2363,13 @@
     "axes[0].grid(True, alpha=0.3)\n",
     "\n",
     "# Plot Pearson Correlation\n",
-    "axes[1].plot(train_steps, train_pearson_scores, 'g-o', label='Train Pearson', markersize=4, linewidth=1.5)\n",
-    "if val_steps:\n",
-    "    axes[1].plot(val_steps, val_pearson_scores, 'orange', marker='s', label='Val Pearson', markersize=4, linewidth=1.5)\n",
     "axes[1].set_xlabel('Step')\n",
     "axes[1].set_ylabel('Pearson Correlation')\n",
     "axes[1].set_title('Mean Pearson Correlation')\n",
     "axes[1].legend()\n",
-    "axes[1].grid(True, alpha=0.3)\n",
-    "\n",
-    "plt.tight_layout()\n",
-    "plt.show()\n"
    ]
   },
   {

     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
+    "📊 We provide access to the NTv3-benchmark data that we released on our Hugging Face dataset: `InstaDeepAI/NTv3_benchmark_dataset`. In this repository, you will find ready-to-use genome FASTA files, Bigwig tracks, metadata, but also the splits that were used for the benchmark.\n",
     "\n",
     "**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
     "- **Constant learning rate**: Uses a fixed learning rate throughout training without learning rate scheduling\n",
     "- **No gradient accumulation**: Implements simple step-based training without gradient accumulation, making the training loop more straightforward\n",
     "\n",
+    "**⚡ Key Advantage**: This simplified pipeline achieves close performance to more complex training approaches while enabling fast fine-tuning: on a H100 GPU and using 16 workers for data loading, it takes ~15min to reach acceptable performances for a 32kb functional tracks prediction task on **NTv3_8M_pre** model. The training speed benefits from the efficient NTv3 model architecture, but of course depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 💻 A note on hardware\n",
     "\n",
+    "While this pipeline is designed to run on limited resources (e.g., Google Colab with a T4 GPU and 2CPUs), the mentioned training time or displayed performances (see **Test evaluation** section) was obtained on a more powerful setup. If you want to reach similar performance levels, you should be aware that you'll need **significant hardware resources** (high-end GPUs with substantial memory and multiple data loading workers). Training times will vary significantly based on your hardware configuration.\n",
     "\n",
+    "📝 Note for Google Colab users: This notebook is compatible with Colab and designed to work with limited resources! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended)."
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
     "import functools\n",
     "import fnmatch\n",
     "import os\n",
     "from pathlib import Path\n",
     "from typing import Callable, Dict, List\n",
     "\n",
     "from huggingface_hub import HfApi, snapshot_download\n",
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
    "metadata": {},
    "source": [
     "# 1. ⚙️ Configuration\n",
     "\n",
+    "⏳ The parameters below are pre-configured to enable training on a T4 GPU (free on Colab). For faster training, use a more powerful GPU and increase the `batch_size`, `learning_rate`, and `num_steps_training` parameters. To speed up dataloading, consider increasing the `num_workers` value if memory and CPU resources allow.\n",
+    "  \n",
+    "🕰️ Current configuration allow to reach decent performances and completes training in ~1h30 on a colab environment with one T4 GPU and 2CPUs. \n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "    # FASTA file\n",
     "    fasta_path_repo = f\"{species}/genome.fasta\"\n",
     "    fasta_path = str(local_dir / fasta_path_repo)\n",
     "    \n",
     "    # BigWig files - use downloaded files directly\n",
     "    bigwig_dir = local_dir / species / \"functional_tracks\"\n",
     "    # Splits file\n",
     "    splits_path_repo = f\"{species}/splits.bed\"\n",
     "    splits_path = local_dir / splits_path_repo\n",
+    "\n",
     "    splits_df = pd.read_csv(\n",
     "        splits_path, \n",
     "        sep=\"\\t\", \n",
     "    metadata_df = pd.read_csv(metadata_path, sep=\"\\t\")\n",
     "\n",
     "    # Filter metadata according to species\n",
+    "    metadata_df = metadata_df[metadata_df[\"species_common_name\"] == species].reset_index(drop=True)\n",
     "\n",
     "    # Order metadata according to bigwig file ids\n",
     "    metadata_df = (\n",
    "source": [
     "# 3. 🧠 Model and tokenizer setup\n",
     " \n",
+    "This section sets up the model by extended any pretrained backbone from HuggingFace Transformers (for example, `InstaDeepAI/ntv3_650M_pre`) with a custom linear head.\n",
+    "This linear head is trained for regression on a set of genomic tracks, allowing the model to make predictions for each track at single nucleotide resolution.\n",
+    "Predictions are center-cropped to focus on the central portion of the input sequence (configurable via `keep_target_center_fraction`), which helps reduce edge effects from sequence context windows.\n"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "def crop_center(x: np.ndarray, keep_target_center_fraction: float = 0.375) -> np.ndarray:\n",
+    "    \"\"\"Crop the central sequence-length fraction for arrays of size (..., seq_len, num_tracks)\"\"\"\n",
+    "    seq_len = x.shape[-2]\n",
+    "    target_offset = int(seq_len * (1 - keep_target_center_fraction) // 2)\n",
+    "    target_length = seq_len - 2 * target_offset\n",
+    "    return x[..., target_offset:target_offset + target_length, :]\n",
+    "\n",
     "class LinearHead(nn.Module):\n",
     "    \"\"\"A linear head that predicts one scalar value per track.\"\"\"\n",
     "    def __init__(self, embed_dim: int, num_labels: int):\n",
     "        self.backbone = torch.compile(backbone)\n",
     "        \n",
     "        self.keep_target_center_fraction = keep_target_center_fraction\n",
+    "        embed_dim = self.config.embed_dim\n",
     "        \n",
     "        # Bigwig head (NTv3 outputs at single-nucleotide resolution)\n",
     "        self.bigwig_head = LinearHead(embed_dim, len(bigwig_track_names))\n",
     "        \n",
     "        # Crop to center fraction\n",
     "        if self.keep_target_center_fraction < 1.0:\n",
+    "            embedding = crop_center(embedding, self.keep_target_center_fraction)\n",
     "        \n",
     "        # Predict bigwig tracks\n",
     "        bigwig_logits = self.bigwig_head(embedding)\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
     "    keep_target_center_fraction=config[\"keep_target_center_fraction\"],\n",
     ")\n",
     "model = model.to(device)\n",
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(bigwig_ids)}\")\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "            _bigwig_cache[cache_key] = pyBigWig.open(abs_path)\n",
     "        except Exception as e:\n",
     "            raise RuntimeError(\n",
+    "                f\"Failed to open BigWig file: {abs_path} with error: {str(e)}\\n\"\n",
     "                f\"File exists: {Path(abs_path).exists()}\\n\"\n",
     "                f\"File size: {Path(abs_path).stat().st_size if Path(abs_path).exists() else 'N/A'} bytes\"\n",
     "            ) from e\n",
     "\n",
     "class GenomeBigWigDataset(Dataset):\n",
     "    \"\"\"\n",
+    "    A PyTorch dataset to access a reference genome and bigwig tracks. The dataset is \n",
+    "    compatible with multi-worker DataLoaders (using process-local file handles and lazy \n",
+    "    loading). For each sample, a random genomic region is picked from the specified split,\n",
+    "    and a random window of length `sequence_length` within that region is returned.\n",
     "    \"\"\"\n",
     "\n",
     "    def __init__(\n",
     "            # Store valid region\n",
     "            self.valid_regions.append((row.chr_name, row.start, row.end))\n",
     "\n",
     "    def __len__(self):\n",
     "        return self.num_samples\n",
     "\n",
     "        \n",
     "        # Crop targets to center fraction\n",
     "        if self.keep_target_center_fraction < 1.0:\n",
+    "            bigwig_targets = crop_center(bigwig_targets, self.keep_target_center_fraction)\n",
     "\n",
     "        # Apply scaling to targets\n",
     "        bigwig_targets = self.transform_fn(bigwig_targets)\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "    metadata_df: pd.DataFrame\n",
     ") -> Callable[[torch.Tensor], torch.Tensor]:\n",
     "    \"\"\"\n",
+    "    Build a scaling function that uses the track means to normalise and softclip the targets.\n",
     "    \"\"\"\n",
     "    # Open bigwig files and compute track statistics\n",
     "    track_means = metadata_df[\"mean\"].to_numpy()\n",
     "    track_means_tensor = torch.tensor(track_means, dtype=torch.float32)\n",
     "\n",
     "    def transform_fn(x: torch.Tensor) -> torch.Tensor:\n",
     "        # Move constants to correct device then normalize\n",
     "        means = track_means_tensor.to(x.device)\n",
     "        scaled = x / means\n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "class TracksMetrics:\n",
+    "    \"\"\"Metrics to handle multi-track pearson correlations and losses\"\"\"\n",
     "    \n",
+    "    def __init__(self, track_names: List[str], split: str):\n",
     "        self.track_names = track_names\n",
     "        self.num_tracks = len(track_names)\n",
+    "        self.split = split\n",
+    "\n",
+    "        # Initialise metrics \n",
+    "        self.pearson = PearsonCorrCoef(num_outputs=self.num_tracks).to(device)\n",
+    "        self.pearson.set_dtype(torch.float64) # Use float64 for improved numerical stability\n",
     "        self.losses = []\n",
+    "\n",
+    "        # Record mean metrics per logging interval\n",
+    "        self.step_idxs = []\n",
+    "        self.mean_pearsons = []\n",
+    "        self.mean_losses = []\n",
     "    \n",
     "    def reset(self):\n",
+    "        self.pearson.reset()\n",
     "        self.losses = []\n",
     "    \n",
     "    def update(\n",
     "        loss: float\n",
     "    ):\n",
     "        \"\"\"\n",
+    "        Update the metrics with predictions and targets of shape (..., num_tracks) and a scalar loss.\n",
     "        \"\"\"\n",
     "        # Flatten batch and sequence dimensions\n",
+    "        pred_flat = predictions.detach().reshape(-1, self.num_tracks).to(torch.float64)  # (N, num_tracks)\n",
+    "        target_flat = targets.detach().reshape(-1, self.num_tracks).to(torch.float64)  # (N, num_tracks)\n",
     "        \n",
+    "        # Update metrics\n",
+    "        self.pearson.update(pred_flat, target_flat)\n",
     "        self.losses.append(loss)\n",
     "    \n",
     "    def compute(self) -> Dict[str, float]:\n",
+    "        \"\"\"Compute the pearson correlations and loss and return a dictionary of metrics.\"\"\"\n",
+    "        # Per-track Pearson correlations\n",
+    "        correlations = self.pearson.compute().cpu().numpy()\n",
+    "        metrics_dict = {\n",
+    "            f\"{track_name}/pearson\": correlations[i] for i, track_name in enumerate(self.track_names)\n",
+    "        }\n",
+    "        metrics_dict[\"mean/pearson\"] = correlations.mean()\n",
     "        \n",
     "        # Mean loss\n",
+    "        metrics_dict[\"loss\"] = np.mean(self.losses)\n",
+    "        \n",
+    "        return metrics_dict\n",
+    "\n",
+    "    def update_mean_metrics(self, step_idx: int):\n",
+    "        \"\"\"Update the mean metrics over the logging interval and save to a csv file.\"\"\"\n",
+    "        # Update mean metrics with the mean pearson & average loss\n",
+    "        metrics_dict = self.compute()\n",
+    "        self.step_idxs.append(step_idx)\n",
+    "        self.mean_pearsons.append(metrics_dict[\"mean/pearson\"])\n",
+    "        self.mean_losses.append(metrics_dict[\"loss\"])\n",
+    "\n",
+    "        # Save metrics to a csv for plotting\n",
+    "        data = {\n",
+    "            \"step\": self.step_idxs,\n",
+    "            \"mean_loss\": self.mean_losses,\n",
+    "            \"mean_pearson\": self.mean_pearsons,\n",
+    "        }\n",
+    "        df = pd.DataFrame(data)\n",
+    "        df.to_csv(f\"metrics_{self.split}.csv\", index=False)\n",
     "        \n",
+    "    def print_metrics(self, print_per_track: bool = False):\n",
+    "        \"\"\"Print a summary of the metrics.\"\"\"\n",
+    "        print(\n",
+    "            f\"Step {self.step_idxs[-1]}/{config['num_steps_training']} | \"\n",
+    "            f\"Loss: {self.mean_losses[-1]:.4f} | \"\n",
+    "            f\"Mean Pearson: {self.mean_pearsons[-1]:.4f}\"\n",
+    "        )\n",
+    "        metrics_dict = self.compute()\n",
+    "        if print_per_track:\n",
+    "            for metric_key, metric_value in metrics_dict.items():\n",
+    "                print(f\"    {metric_key}: {metric_value:.4f}\")\n",
+    "    "
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "train_metrics = TracksMetrics(bigwig_ids, \"train\")\n",
+    "val_metrics = TracksMetrics(bigwig_ids, \"val\")\n",
+    "test_metrics = TracksMetrics(bigwig_ids, \"test\")"
    ]
   },
   {
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "    epsilon: float = 1e-7,\n",
     ") -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n",
     "    \"\"\"\n",
+    "    Regression loss for bigwig tracks (Poisson-Multinomial). The logits and targets are\n",
+    "    expected to be of shape (batch, seq_length, num_tracks).\n",
     "    \"\"\"\n",
     "    batch_size, seq_length, num_tracks = logits.shape\n",
     "    \n",
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "def train_step(\n",
     "    model: nn.Module,\n",
+    "    optimizer: torch.optim.Optimizer,\n",
     "    batch: Dict[str, torch.Tensor],\n",
+    "    train_metrics: TracksMetrics,\n",
+    ") -> None:\n",
     "    \"\"\"Single training step.\"\"\"\n",
     "    tokens = batch[\"tokens\"].to(device)\n",
     "    bigwig_targets = batch[\"bigwig_targets\"].to(device)\n",
     "        logits=bigwig_logits,\n",
     "        targets=bigwig_targets,\n",
     "    )\n",
+    "\n",
     "    # Backward pass\n",
+    "    optimizer.zero_grad()\n",
     "    loss.backward()\n",
+    "    optimizer.step()\n",
+    "\n",
+    "    # Update metrics\n",
+    "    train_metrics.update(\n",
+    "        predictions=bigwig_logits,\n",
+    "        targets=bigwig_targets,\n",
+    "        loss=loss.item()\n",
+    "    )\n",
+    "    \n",
+    "\n",
     "\n",
     "def validation_step(\n",
     "    model: nn.Module,\n",
     "    batch: Dict[str, torch.Tensor],\n",
     "    metrics: TracksMetrics,\n",
+    ") -> None:\n",
     "    \"\"\"Single validation step.\"\"\"\n",
     "    tokens = batch[\"tokens\"].to(device)\n",
     "    bigwig_targets = batch[\"bigwig_targets\"].to(device)\n",
     "    \n",
     "            predictions=bigwig_logits,\n",
     "            targets=bigwig_targets,\n",
     "            loss=loss.item()\n",
+    "        )"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
    ],
    "source": [
     "# Training loop\n",
+    "print(f\"Starting training for {config['num_steps_training']} steps\\n\")\n",
     "\n",
     "# Create iterator for training data (will cycle if needed)\n",
     "train_iter = iter(train_loader)\n",
+    "model.train()\n",
     "\n",
     "# Main training loop\n",
     "for step_idx in range(config[\"num_steps_training\"]):\n",
     "        train_iter = iter(train_loader)\n",
     "        batch = next(train_iter)\n",
     "    \n",
+    "    # Take a training step\n",
+    "    train_step(model, optimizer, batch, train_metrics)\n",
+    "\n",
     "    # Logging\n",
     "    if (step_idx + 1) % config[\"log_every_n_steps\"] == 0:\n",
+    "        train_metrics.update_mean_metrics(step_idx + 1)\n",
+    "        train_metrics.print_metrics()\n",
     "        train_metrics.reset()\n",
     "    \n",
     "    # Validation\n",
     "    if (step_idx + 1) % config[\"validate_every_n_steps\"] == 0:\n",
     "        print(f\"\\nRunning validation at step {step_idx + 1}...\")\n",
     "        model.eval()\n",
     "        \n",
     "        for val_batch in val_loader:\n",
+    "            validation_step(model, val_batch, val_metrics)\n",
     "        \n",
+    "        val_metrics.update_mean_metrics(step_idx + 1)\n",
+    "        val_metrics.print_metrics(print_per_track=True)\n",
+    "        val_metrics.reset()\n",
+    "\n",
+    "        # Back to training mode\n",
+    "        print(\"\\n\" + \"-\"*100 + \"\\nTraining metrics:\")\n",
+    "        model.train()  \n",
     "\n",
     "print(f\"\\nTraining completed after {config['num_steps_training']} steps.\")\n"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
    ],
    "source": [
     "# Plot training results\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
+    "\n",
+    "df_train = pd.read_csv(\"metrics_train.csv\")\n",
+    "df_val = pd.read_csv(\"metrics_val.csv\")\n",
     "\n",
     "# Plot Loss\n",
+    "axes[0].plot(df_train[\"step\"], df_train[\"mean_loss\"], 'b-o', label='Train Loss', markersize=4, linewidth=1.5)\n",
+    "axes[0].plot(df_val[\"step\"], df_val[\"mean_loss\"], 'r-s', label='Val Loss', markersize=4, linewidth=1.5)\n",
     "axes[0].set_xlabel('Step')\n",
     "axes[0].set_ylabel('Loss')\n",
     "axes[0].set_title('Loss')\n",
     "axes[0].grid(True, alpha=0.3)\n",
     "\n",
     "# Plot Pearson Correlation\n",
+    "axes[1].plot(df_train[\"step\"], df_train[\"mean_pearson\"], 'g-o', label='Train Pearson', markersize=4, linewidth=1.5)\n",
+    "axes[1].plot(df_val[\"step\"], df_val[\"mean_pearson\"], 'orange', marker='s', label='Val Pearson', markersize=4, linewidth=1.5)\n",
     "axes[1].set_xlabel('Step')\n",
     "axes[1].set_ylabel('Pearson Correlation')\n",
     "axes[1].set_title('Mean Pearson Correlation')\n",
     "axes[1].legend()\n",
+    "axes[1].grid(True, alpha=0.3)"
    ]
   },
   {