Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

ybornachot commited on Dec 12, 2025

Commit

eadf098

1 Parent(s): b0d8db2

fix: come back to older version

Browse files

Files changed (1) hide show

notebooks/03_fine_tuning.ipynb +499 -170

notebooks/03_fine_tuning.ipynb CHANGED Viewed

@@ -4,15 +4,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 🧬 Fine-Tuning a Model on BigWig Tracks Prediction\n",
     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
-    "**⚡ Key Advantage**: This simplified pipeline achieves **close performance to more complex training approaches** while enabling **fast fine-tuning**. The training speed benefits from the efficient NTv3 model architecture and depends on your hardware capabilities (GPU acceleration and multi-worker data loading significantly reduce training time). With NTv3 models, meaningful Pearson correlations can typically be reached within ~10minutes of training on a 32kb functional tracks prediction task. \n",
     "\n",
-    "While this notebook currently focuses on NTv3 models, the pipeline structure can be extended to work with other foundation models. The setup is designed for rapid experimentation and iteration, making it ideal for adapting pre-trained models to your specific genomic tracks or experimental conditions without the overhead of complex distributed training infrastructure.\n",
-    "\n",
-    "**🔧 Main Simplifications**: Compared to the full supervised tracks prediction pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "\n",
     "- **Data splits**: Uses simple chromosome-based train/val/test splits (e.g., assigning entire chromosomes to each split) instead of more complex region-based splits\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
@@ -32,14 +30,7 @@
     "\n",
     "If you're interested in using pre-trained models for inference without fine-tuning, or exploring different model architectures, please refer to other notebooks in this collection. This notebook focuses specifically on the simplified fine-tuning process, which is useful when you want to quickly adapt a pre-trained model to your specific genomic tracks or improve performance on particular cell types or experimental conditions.\n",
     "\n",
-    "📝 Note for Google Colab users: This notebook is compatible with Colab! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# 0. 📦 Imports"
    ]
   },
   {
@@ -49,38 +40,14 @@
    "outputs": [],
    "source": [
     "# Install dependencies\n",
-    "!pip install datasets transformers torchmetrics plotly "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Imports\n",
-    "from typing import List, Dict\n",
-    "import os\n",
-    "\n",
-    "import torch\n",
-    "import torch.nn as nn\n",
-    "import torch.nn.functional as F\n",
-    "from torch.utils.data import DataLoader\n",
-    "from torch.optim import AdamW\n",
-    "from transformers import AutoConfig, AutoModelForMaskedLM, AutoTokenizer\n",
-    "from datasets import load_dataset\n",
-    "import numpy as np\n",
-    "from torchmetrics import PearsonCorrCoef\n",
-    "import plotly.graph_objects as go\n",
-    "from IPython.display import display\n",
-    "from tqdm import tqdm"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 1. ⚙️ Configuration\n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
@@ -116,25 +83,45 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Using device: cpu\n"
-     ]
-    }
-   ],
    "source": [
     "config = {\n",
     "    # Model\n",
     "    \"model_name\": \"InstaDeepAI/NTv3_8M_pre\",\n",
     "    \n",
-    "    # Data - Hugging Face Dataset Configuration\n",
-    "    \"dataset_name\": \"InstaDeepAI/bigwig_tracks\",  # Hugging Face dataset name or path to script\n",
     "    \"data_cache_dir\": \"./data\",\n",
     "    \"fasta_url\": \"https://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz\",\n",
     "    \"bigwig_url_list\": [\n",
     "        \"https://www.encodeproject.org/files/ENCFF055QKS/@@download/ENCFF055QKS.bigWig\",\n",
     "        \"https://www.encodeproject.org/files/ENCFF214GOQ/@@download/ENCFF214GOQ.bigWig\",\n",
     "        \"https://www.encodeproject.org/files/ENCFF592NIB/@@download/ENCFF592NIB.bigWig\",\n",
@@ -165,8 +152,25 @@
     "\n",
     "os.makedirs(config[\"data_cache_dir\"], exist_ok=True)\n",
     "\n",
     "# Create bigwig_file_ids from filenames (without extension)\n",
     "config[\"bigwig_file_ids\"] = [\n",
     "    \"ENCSR325NFE\",\n",
     "    \"ENCSR962OTG\",\n",
     "    \"ENCSR619DQO_P\",\n",
@@ -186,7 +190,67 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 2. 🧠 Model and tokenizer setup\n",
     " \n",
     "In this section, we set up the model and tokenizer. \n",
     " \n",
@@ -196,12 +260,12 @@
     "This linear head is trained for regression on a set of genomic tracks, \n",
     "allowing the model to make predictions for each track at single nucleotide resolution.\n",
     " \n",
-    "The following code wraps the HuggingFace model together with this regression head for the end-to-end task."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -269,19 +333,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Model loaded: InstaDeepAI/NTv3_8M_pre\n",
-      "Number of bigwig tracks: 4\n",
-      "Model parameters: 7,694,015\n"
-     ]
-    }
-   ],
    "source": [
     "# Load tokenizer\n",
     "tokenizer = AutoTokenizer.from_pretrained(config[\"model_name\"], trust_remote_code=True)\n",
@@ -297,16 +351,168 @@
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(config['bigwig_file_ids'])}\")\n",
-    "print(f\"Model parameters: {sum(p.numel() for p in model.parameters()):,}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 3. 📥 Dataset setup\n",
     "\n",
-    "Load the Hugging Face dataset and set up the data pipeline. The dataset automatically handles downloading FASTA and BigWig files, normalizing tracks, and sampling random genomic windows."
    ]
   },
   {
@@ -315,30 +521,161 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Chromosomes split definition\n",
-    "chrom_splits = {\n",
-    "    \"train\": [f\"chr{i}\" for i in range(1, 21)] + ['chrX', 'chrY'],\n",
-    "    \"val\": ['chr22'],\n",
-    "    \"test\": ['chr21']\n",
-    "}\n",
     "\n",
-    "# Number of desired samples per split\n",
-    "num_samples = {\n",
-    "    \"train\": config[\"num_steps_training\"] * config[\"batch_size\"],\n",
-    "    \"val\": config[\"num_validation_samples\"],\n",
-    "    \"test\": config[\"num_test_samples\"],\n",
-    "}\n",
     "\n",
-    "print(f\"Loading dataset from {config['dataset_name']}...\")\n",
-    "dataset = load_dataset(\n",
-    "    config[\"dataset_name\"],\n",
-    "    data_files=chrom_splits,\n",
-    "    num_samples=num_samples,\n",
-    "    fasta_url=config[\"fasta_url\"],\n",
-    "    bigwig_urls=config[\"bigwig_url_list\"],\n",
-    "    sequence_length=config[\"sequence_length\"],\n",
-    "    data_dir=config[\"data_cache_dir\"],\n",
-    ")"
    ]
   },
   {
@@ -347,92 +684,84 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Tokenization function\n",
-    "def tokenize_examples(examples):\n",
-    "    \"\"\"Tokenize sequences and prepare targets.\"\"\"\n",
-    "    sequences = examples[\"sequence\"]\n",
-    "    \n",
-    "    # Tokenize sequences\n",
-    "    tokenized = tokenizer(\n",
-    "        sequences,\n",
-    "        max_length=config[\"sequence_length\"],\n",
-    "        padding=\"max_length\",\n",
-    "        truncation=True,\n",
-    "        return_tensors=None,\n",
-    "    )\n",
-    "    \n",
-    "    # Crop targets to center fraction if needed\n",
-    "    if config[\"keep_target_center_fraction\"] < 1.0:\n",
-    "        seq_len = examples[\"bigwig_targets\"].shape[0]\n",
-    "        target_offset = int(seq_len * (1 - config[\"keep_target_center_fraction\"]) // 2)\n",
-    "        target_length = seq_len - 2 * target_offset\n",
-    "        examples[\"bigwig_targets\"] = examples[\"bigwig_targets\"][target_offset:target_offset + target_length, :]\n",
-    "    \n",
-    "    return {\n",
-    "        \"tokens\": tokenized[\"input_ids\"],\n",
-    "        \"bigwig_targets\": examples[\"bigwig_targets\"],\n",
-    "    }\n",
     "\n",
-    "# Apply tokenization\n",
-    "print(\"Tokenizing sequences...\")\n",
-    "dataset = dataset.map(\n",
-    "    tokenize_examples,\n",
-    "    batched=True,\n",
-    "    remove_columns=[\"sequence\"],  # Remove original sequence after tokenization\n",
     ")\n",
     "\n",
-    "# Format for PyTorch\n",
-    "dataset = dataset.with_format(\"torch\")\n",
     "\n",
-    "dataloaders = {}\n",
-    "for split_name in chrom_splits.keys():\n",
-    "    dataloaders[split_name] = DataLoader(\n",
-    "        dataset[split_name],\n",
-    "        batch_size=config[\"batch_size\"],\n",
-    "        shuffle=(split_name == \"train\"),\n",
-    "        num_workers=config[\"num_workers\"],\n",
-    "    )\n",
     "\n",
-    "# Extract DataLoaders\n",
-    "train_loader = dataloaders[\"train\"]\n",
-    "val_loader = dataloaders[\"val\"]\n",
-    "test_loader = dataloaders[\"test\"]\n",
     "\n",
-    "print(f\"\\nData pipeline created successfully!\")\n",
-    "print(f\"Train batches: {len(train_loader)}\")\n",
-    "print(f\"Val batches: {len(val_loader)}\")\n",
-    "print(f\"Test batches: {len(test_loader)}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 4. ⚙️ Optimizer setup\n",
     "\n",
-    "Configure the AdamW optimizer with learning rate and weight decay hyperparameters. This optimizer will update the model parameters during training to minimize the loss function."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Training configuration:\n",
-      "  Batch size: 32\n",
-      "  Total training steps: 19932\n",
-      "  Log metrics every: 40 steps\n",
-      "  Validate every: 400 steps\n",
-      "\n",
-      "Optimizer setup:\n",
-      "  Learning rate: 1e-05\n"
-     ]
-    }
-   ],
    "source": [
     "# Training setup\n",
     "print(f\"Training configuration:\")\n",
@@ -456,14 +785,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 5. 📊 Metrics setup\n",
     "\n",
     "Set up evaluation metrics to track model performance during training and validation. We use Pearson correlation coefficients to measure how well the predicted BigWig signals match the ground truth signals."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -536,7 +865,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -549,14 +878,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 6. 📉 Loss function\n",
     "\n",
     "Define the Poisson-Multinomial loss function that captures both the scale (total signal) and shape (distribution) of BigWig tracks. This loss is specifically designed for count-based genomic signal data."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -631,14 +960,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 7. 🏃 Training loop\n",
     "\n",
     "Run the main training loop that iterates through batches, computes gradients, and updates model parameters. The loop includes periodic validation checks and real-time metric visualization to monitor training progress."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -706,7 +1035,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -863,7 +1192,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 9. 🧪 Test evaluation\n",
     "\n",
     "Evaluate the fine-tuned model on the held-out test set to assess final performance. This provides an unbiased estimate of how well the model generalizes to unseen genomic regions."
    ]
@@ -941,4 +1270,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# \ud83e\uddec Fine-Tuning a Model on BigWig Tracks Prediction\n",
     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
+    "**\u26a1 Key Advantage**: This simplified pipeline achieves **close performance to more complex training approaches** while enabling **relatively fast fine-tuning in approximately one hour**. The setup is designed for rapid experimentation and iteration, making it ideal for adapting pre-trained models to your specific genomic tracks or experimental conditions without the overhead of complex distributed training infrastructure.\n",
     "\n",
+    "**\ud83d\udd27 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "\n",
     "- **Data splits**: Uses simple chromosome-based train/val/test splits (e.g., assigning entire chromosomes to each split) instead of more complex region-based splits\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
     "\n",
     "If you're interested in using pre-trained models for inference without fine-tuning, or exploring different model architectures, please refer to other notebooks in this collection. This notebook focuses specifically on the simplified fine-tuning process, which is useful when you want to quickly adapt a pre-trained model to your specific genomic tracks or improve performance on particular cell types or experimental conditions.\n",
     "\n",
+    "\ud83d\udcdd Note for Google Colab users: This notebook is compatible with Colab! For faster training, make sure to enable GPU: Runtime \u2192 Change runtime type \u2192 GPU (T4 or better recommended).\n"
    ]
   },
   {
    "outputs": [],
    "source": [
     "# Install dependencies\n",
+    "!pip install pyfaidx pyBigWig torchmetrics transformers plotly"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 1. \ud83d\udce6 Imports + Configuration\n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 0. Imports\n",
+    "import random\n",
+    "import functools\n",
+    "from typing import List, Dict, Callable\n",
+    "import os\n",
+    "import subprocess\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from torch.optim import AdamW\n",
+    "from transformers import AutoConfig, AutoModelForMaskedLM, AutoTokenizer\n",
+    "import numpy as np\n",
+    "import pyBigWig\n",
+    "from pyfaidx import Fasta\n",
+    "from torchmetrics import PearsonCorrCoef\n",
+    "import plotly.graph_objects as go\n",
+    "from IPython.display import display\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "config = {\n",
     "    # Model\n",
     "    \"model_name\": \"InstaDeepAI/NTv3_8M_pre\",\n",
     "    \n",
+    "    # Data\n",
     "    \"data_cache_dir\": \"./data\",\n",
     "    \"fasta_url\": \"https://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz\",\n",
     "    \"bigwig_url_list\": [\n",
+    "        # \"https://www.encodeproject.org/files/ENCFF884LDL/@@download/ENCFF884LDL.bigWig\",\n",
     "        \"https://www.encodeproject.org/files/ENCFF055QKS/@@download/ENCFF055QKS.bigWig\",\n",
     "        \"https://www.encodeproject.org/files/ENCFF214GOQ/@@download/ENCFF214GOQ.bigWig\",\n",
     "        \"https://www.encodeproject.org/files/ENCFF592NIB/@@download/ENCFF592NIB.bigWig\",\n",
     "\n",
     "os.makedirs(config[\"data_cache_dir\"], exist_ok=True)\n",
     "\n",
+    "# Extract filenames from URLs\n",
+    "def extract_filename_from_url(url: str) -> str:\n",
+    "    \"\"\"Extract filename from URL, handling query parameters.\"\"\"\n",
+    "    # Remove query parameters if present\n",
+    "    url_clean = url.split('?')[0]\n",
+    "    # Get the last part of the URL path\n",
+    "    return url_clean.split('/')[-1]\n",
+    "\n",
+    "# Create paths for downloaded files\n",
+    "fasta_path = os.path.join(config[\"data_cache_dir\"], extract_filename_from_url(config[\"fasta_url\"]).replace('.gz', ''))\n",
+    "bigwig_path_list = [\n",
+    "    os.path.join(config[\"data_cache_dir\"], extract_filename_from_url(url))\n",
+    "    for url in config[\"bigwig_url_list\"]\n",
+    "]\n",
+    "\n",
     "# Create bigwig_file_ids from filenames (without extension)\n",
     "config[\"bigwig_file_ids\"] = [\n",
+    "    # os.path.splitext(extract_filename_from_url(url))[0]\n",
+    "    # for url in config[\"bigwig_url_list\"]\n",
     "    \"ENCSR325NFE\",\n",
     "    \"ENCSR962OTG\",\n",
     "    \"ENCSR619DQO_P\",\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 2. \ud83d\udce5 Genome & Tracks Data Download\n",
+    "\n",
+    "Download the reference genome FASTA file and BigWig signal tracks from public repositories. These files contain the genomic sequences and experimental signal data (e.g., ChIP-seq, ATAC-seq) that we'll use for training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download fasta file\n",
+    "fasta_filename = extract_filename_from_url(config[\"fasta_url\"])\n",
+    "fasta_gz_path = os.path.join(config[\"data_cache_dir\"], fasta_filename)\n",
+    "\n",
+    "print(f\"Downloading {fasta_filename}...\")\n",
+    "subprocess.run([\"wget\", \"-c\", config[\"fasta_url\"], \"-O\", fasta_gz_path], check=True)\n",
+    "\n",
+    "print(f\"Extracting {fasta_filename}...\")\n",
+    "subprocess.run([\"gunzip\", \"-f\", fasta_gz_path], check=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download bigwig files\n",
+    "for bigwig_url in config[\"bigwig_url_list\"]:\n",
+    "    filename = extract_filename_from_url(bigwig_url)\n",
+    "    filepath = os.path.join(config[\"data_cache_dir\"], filename)\n",
+    "    print(f\"Downloading {filename}...\")\n",
+    "    subprocess.run([\"wget\", \"-c\", bigwig_url, \"-O\", filepath], check=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Splits Definition"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chrom_splits = {\n",
+    "    \"train\": [f\"chr{i}\" for i in range(1, 21)] + ['chrX', 'chrY'],\n",
+    "    \"val\": ['chr22'],\n",
+    "    \"test\": ['chr21']\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3. \ud83e\udde0 Model and tokenizer setup\n",
     " \n",
     "In this section, we set up the model and tokenizer. \n",
     " \n",
     "This linear head is trained for regression on a set of genomic tracks, \n",
     "allowing the model to make predictions for each track at single nucleotide resolution.\n",
     " \n",
+    "The following code wraps the HuggingFace model together with this regression head for the end-to-end task.\n"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Load tokenizer\n",
     "tokenizer = AutoTokenizer.from_pretrained(config[\"model_name\"], trust_remote_code=True)\n",
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(config['bigwig_file_ids'])}\")\n",
+    "print(f\"Model parameters: {sum(p.numel() for p in model.parameters()):,}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Scaling functions for targets\n",
+    "def compute_chromosome_stats(track_data: np.ndarray) -> dict:\n",
+    "    \"\"\"\n",
+    "    Compute minimal statistics needed for weighted mean computation.\n",
+    "    \n",
+    "    Args:\n",
+    "        track_data: numpy array of track values for a chromosome\n",
+    "        \n",
+    "    Returns:\n",
+    "        Dictionary with statistics: sum, mean, total_count\n",
+    "    \"\"\"\n",
+    "    track_data = track_data.astype(np.float32)\n",
+    "    \n",
+    "    # Compute statistics\n",
+    "    sum_all = np.sum(track_data)\n",
+    "    total_count = track_data.size\n",
+    "    mean_all = sum_all / total_count if total_count > 0 else 0.0\n",
+    "    \n",
+    "    return {\n",
+    "        \"sum\": sum_all,\n",
+    "        \"mean\": mean_all,\n",
+    "        \"total_count\": total_count,\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "def aggregate_file_statistics(chr_stats_list: List[dict]) -> dict:\n",
+    "    \"\"\"\n",
+    "    Aggregate chromosome-level statistics into file-level statistics.\n",
+    "    \n",
+    "    Args:\n",
+    "        chr_stats_list: List of dictionaries, each containing chromosome-level statistics\n",
+    "        \n",
+    "    Returns:\n",
+    "        Dictionary with aggregated file-level statistics (only mean)\n",
+    "    \"\"\"\n",
+    "    # Convert to arrays for easier computation\n",
+    "    total_counts = np.array([s[\"total_count\"] for s in chr_stats_list], dtype=np.int64)\n",
+    "    means = np.array([s[\"mean\"] for s in chr_stats_list], dtype=np.float32)\n",
+    "    sums = np.array([s[\"sum\"] for s in chr_stats_list], dtype=np.float32)\n",
+    "    \n",
+    "    # Aggregate total count\n",
+    "    total_count = np.sum(total_counts)\n",
+    "    \n",
+    "    # Weighted mean: mean = sum(mean_chr * total_count_chr) / sum(total_count_chr)\n",
+    "    mean = np.sum(means * total_counts) / total_count if total_count > 0 else 0.0\n",
+    "    \n",
+    "    return {\n",
+    "        \"total_count\": total_count,\n",
+    "        \"sum\": np.sum(sums),\n",
+    "        \"mean\": mean,\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "def get_track_means(bigwig_tracks_list: List[pyBigWig.pyBigWig]) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Get track means for normalization.\n",
+    "    Computes statistics per chromosome and aggregates using weighted averaging,\n",
+    "    \n",
+    "    Args:\n",
+    "        bigwig_tracks_list: List of pyBigWig file objects\n",
+    "        \n",
+    "    Returns:\n",
+    "        Array of track means, one per bigwig file\n",
+    "    \"\"\"\n",
+    "    track_means = []\n",
+    "    \n",
+    "    for bigwig_track in bigwig_tracks_list:\n",
+    "        chrom_lengths = bigwig_track.chroms()\n",
+    "        all_chr_stats = []\n",
+    "        \n",
+    "        # Compute statistics for each chromosome\n",
+    "        for chrom_name, chrom_length in chrom_lengths.items():\n",
+    "            try:\n",
+    "                # Get chromosome data as numpy array\n",
+    "                bw_array = np.array(\n",
+    "                    bigwig_track.values(chrom_name, 0, chrom_length, numpy=True),\n",
+    "                    dtype=np.float32\n",
+    "                )\n",
+    "                # Replace NaN with 0\n",
+    "                bw_array = np.nan_to_num(bw_array, nan=0.0)\n",
+    "                \n",
+    "                # Compute chromosome-level statistics\n",
+    "                chr_stats = compute_chromosome_stats(bw_array)\n",
+    "                all_chr_stats.append(chr_stats)\n",
+    "            except Exception as e:\n",
+    "                # Skip chromosomes that fail to load\n",
+    "                print(f\"Warning: Failed to load chromosome {chrom_name}: {e}\")\n",
+    "                continue\n",
+    "        \n",
+    "        if not all_chr_stats:\n",
+    "            raise ValueError(f\"No valid chromosomes found for bigwig track\")\n",
+    "        \n",
+    "        # Aggregate chromosome-level stats into file-level stats\n",
+    "        file_stats = aggregate_file_statistics(all_chr_stats)\n",
+    "        \n",
+    "        # Use the weighted mean for normalization\n",
+    "        track_means.append(file_stats[\"mean\"])\n",
+    "    \n",
+    "    return np.array(track_means, dtype=np.float32)\n",
+    "\n",
+    "\n",
+    "def create_targets_scaling_fn(bigwig_path_list: List[str]) -> Callable[[torch.Tensor], torch.Tensor]:\n",
+    "    \"\"\"\n",
+    "    Build a scaling function based on track means computed from bigwig files.\n",
+    "    \n",
+    "    Opens bigwig files, computes track statistics, and creates a transform function.\n",
+    "    The statistics are computed once and reused for all calls to the returned transform function.\n",
+    "    \n",
+    "    Args:\n",
+    "        bigwig_path_list: List of paths to bigwig files\n",
+    "        \n",
+    "    Returns:\n",
+    "        Transform function that scales input tensors\n",
+    "    \"\"\"\n",
+    "    # Open bigwig files and compute track statistics\n",
+    "    print(\"Computing track statistics (this may take a while)...\")\n",
+    "    bw_list = [\n",
+    "        pyBigWig.open(bigwig_path)\n",
+    "        for bigwig_path in bigwig_path_list\n",
+    "    ]\n",
+    "    track_means = get_track_means(bw_list)\n",
+    "    print(f\"Computed track means: {track_means}\")\n",
+    "    print(f\"Track means shape: {track_means.shape}\")\n",
+    "    \n",
+    "    # Create tensor from computed means\n",
+    "    track_means_tensor = torch.tensor(track_means, dtype=torch.float32)\n",
+    "    \n",
+    "    def transform_fn(x: torch.Tensor) -> torch.Tensor:\n",
+    "        \"\"\"\n",
+    "        x: torch.Tensor, shape (seq_len, num_tracks) or (batch, seq_len, num_tracks)\n",
+    "        \"\"\"\n",
+    "        # Move constants to correct device then normalize\n",
+    "        means = track_means_tensor.to(x.device)\n",
+    "        scaled = x / means\n",
+    "\n",
+    "        # Smooth clipping: if > 10, apply formula\n",
+    "        clipped = torch.where(\n",
+    "            scaled > 10.0,\n",
+    "            2.0 * torch.sqrt(scaled * 10.0) - 10.0,\n",
+    "            scaled,\n",
+    "        )\n",
+    "        return clipped\n",
+    "    \n",
+    "    return transform_fn"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 4. \ud83d\udd04 Data loading\n",
     "\n",
+    "Create PyTorch datasets and data loaders that efficiently sample random genomic windows from the reference genome and extract corresponding BigWig signal values. The dataset handles sequence tokenization, target scaling, and chromosome-based train/val/test splits."
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Process-local cache for BigWig file handles (one per worker process)\n",
+    "# This allows safe multi-worker DataLoader usage\n",
+    "_bigwig_cache = {}  # Maps (process_id, file_path) -> pyBigWig handle\n",
     "\n",
     "\n",
+    "def _get_bigwig_handle(bigwig_path: str) -> pyBigWig.pyBigWig:\n",
+    "    \"\"\"Get or create a BigWig file handle for the current process.\"\"\"\n",
+    "    process_id = os.getpid()\n",
+    "    cache_key = (process_id, bigwig_path)\n",
+    "    \n",
+    "    if cache_key not in _bigwig_cache:\n",
+    "        _bigwig_cache[cache_key] = pyBigWig.open(bigwig_path)\n",
+    "    \n",
+    "    return _bigwig_cache[cache_key]\n",
+    "\n",
+    "\n",
+    "class GenomeBigWigDataset(Dataset):\n",
+    "    \"\"\"\n",
+    "    Random genomic windows from a reference genome + bigWig signal.\n",
+    "\n",
+    "    Each sample:\n",
+    "        - picks a chromosome/region (from `chroms` or `regions`),\n",
+    "        - picks a random window of length `sequence_length`,\n",
+    "        - returns (sequence, signal, chrom, start, end).\n",
+    "\n",
+    "    This dataset is compatible with multi-worker DataLoaders. BigWig files\n",
+    "    are opened lazily using a process-local cache, ensuring each worker process\n",
+    "    has its own file handles and avoiding concurrent access issues.\n",
+    "\n",
+    "    Args\n",
+    "    ----\n",
+    "    fasta_path : str\n",
+    "        Path to the reference genome FASTA (e.g. hg38.fna).\n",
+    "    bigwig_path_list : str\n",
+    "        Path to the bigWig file (e.g. ENCFF884LDL.bigWig).\n",
+    "    chroms : List[str]\n",
+    "        Chromosome names as they appear in the bigWig (e.g. [\"chr1\", \"chr2\", ...]).\n",
+    "        Used for backward compatibility or when regions=None.\n",
+    "    sequence_length : int\n",
+    "        Length of each random window (in bp).\n",
+    "    num_samples : int\n",
+    "        Number of samples the dataset will provide (len(dataset)).\n",
+    "    tokenizer : AutoTokenizer\n",
+    "        Tokenizer to use for tokenization.\n",
+    "    transform_fn : Callable\n",
+    "        Function to transform/scaling bigwig targets.\n",
+    "    keep_target_center_fraction : float\n",
+    "        Fraction of center sequence to keep for target prediction (crops edges to focus on center).\n",
+    "    regions : List[tuple[str, int, int]] | None\n",
+    "        Optional list of regions as (chromosome, start, end) tuples.\n",
+    "        If provided, samples are drawn randomly from within these regions only.\n",
+    "        This matches the JAX pipeline approach using BED file splits.\n",
+    "        If None, samples from entire chromosomes in `chroms`.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(\n",
+    "        self,\n",
+    "        fasta_path: str,\n",
+    "        bigwig_path_list: list[str],\n",
+    "        chroms: List[str],\n",
+    "        sequence_length: int,\n",
+    "        num_samples: int,\n",
+    "        tokenizer: AutoTokenizer,\n",
+    "        transform_fn: Callable[[torch.Tensor], torch.Tensor],\n",
+    "        keep_target_center_fraction: float = 1.0,\n",
+    "    ):\n",
+    "        super().__init__()\n",
+    "\n",
+    "        self.fasta = Fasta(fasta_path, as_raw=True, sequence_always_upper=True)\n",
+    "        # Store paths instead of opening files immediately (for multi-worker compatibility)\n",
+    "        self.bigwig_path_list = bigwig_path_list\n",
+    "        self.sequence_length = sequence_length\n",
+    "        self.num_samples = num_samples\n",
+    "        self.tokenizer = tokenizer\n",
+    "        self.transform_fn = transform_fn  # Use pre-computed transform function\n",
+    "        self.keep_target_center_fraction = keep_target_center_fraction\n",
+    "        self.chroms = chroms\n",
+    "\n",
+    "        # Get chromosome lengths from first BigWig file (lazy, cached per process)\n",
+    "        # We need this for validation, so open temporarily\n",
+    "        bw_handle = _get_bigwig_handle(bigwig_path_list[0])\n",
+    "        bw_chrom_lengths = bw_handle.chroms()  # dict: chrom -> length\n",
+    "\n",
+    "        self.valid_chroms = []\n",
+    "        self.chrom_lengths = {}\n",
+    "\n",
+    "        for c in chroms:\n",
+    "            if c not in bw_chrom_lengths or c not in self.fasta:\n",
+    "                continue\n",
+    "\n",
+    "            fa_len = len(self.fasta[c])\n",
+    "            bw_len = bw_chrom_lengths[c]\n",
+    "            L = min(fa_len, bw_len)\n",
+    "\n",
+    "            if L > self.sequence_length:\n",
+    "                self.valid_chroms.append(c)\n",
+    "                self.chrom_lengths[c] = L\n",
+    "\n",
+    "        if not self.valid_chroms:\n",
+    "            raise ValueError(\"No valid chromosomes after intersecting FASTA and bigWig.\")\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return self.num_samples\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "\n",
+    "        # Sample from entire chromosomes\n",
+    "        chrom = random.choice(self.valid_chroms)\n",
+    "        chrom_len = self.chrom_lengths[chrom]\n",
+    "        max_start = chrom_len - self.sequence_length\n",
+    "        start = random.randint(0, max_start)\n",
+    "        end = start + self.sequence_length\n",
+    "\n",
+    "        # Sequence\n",
+    "        seq = self.fasta[chrom][start:end]  # string slice\n",
+    "        # Tokenize with padding and truncation to ensure consistent lengths for batching\n",
+    "        tokenized = self.tokenizer(\n",
+    "            seq,\n",
+    "            padding=\"max_length\",\n",
+    "            truncation=True,\n",
+    "            max_length=self.sequence_length,\n",
+    "            return_tensors=\"pt\",\n",
+    "        )\n",
+    "        tokens = tokenized[\"input_ids\"][0]  # Shape: (max_length,)\n",
+    "\n",
+    "        # Signal from bigWig tracks (numpy array) -> torch tensor\n",
+    "        # Get BigWig handles lazily (cached per worker process)\n",
+    "        bigwig_targets = np.array([\n",
+    "            _get_bigwig_handle(bw_path).values(chrom, start, end, numpy=True)\n",
+    "            for bw_path in self.bigwig_path_list\n",
+    "        ])  # shape (num_tracks, seq_len)\n",
+    "        # Transpose to (seq_len, num_tracks)\n",
+    "        bigwig_targets = bigwig_targets.T\n",
+    "        # pyBigWig returns NaN where no data; turn NaN into 0\n",
+    "        bigwig_targets = torch.tensor(bigwig_targets, dtype=torch.float32)\n",
+    "        bigwig_targets = torch.nan_to_num(bigwig_targets, nan=0.0)\n",
+    "        \n",
+    "        # Crop targets to center fraction\n",
+    "        if self.keep_target_center_fraction < 1.0:\n",
+    "            seq_len = bigwig_targets.shape[0]  # First dimension is sequence length\n",
+    "            target_offset = int(seq_len * (1 - self.keep_target_center_fraction) // 2)\n",
+    "            target_length = seq_len - 2 * target_offset\n",
+    "            bigwig_targets = bigwig_targets[target_offset:target_offset + target_length, :]\n",
+    "\n",
+    "        # Apply scaling to targets\n",
+    "        bigwig_targets = self.transform_fn(bigwig_targets)\n",
+    "\n",
+    "        sample = {\n",
+    "            \"tokens\": tokens,\n",
+    "            \"bigwig_targets\": bigwig_targets,\n",
+    "            \"chrom\": chrom,\n",
+    "            \"start\": start,\n",
+    "            \"end\": end,\n",
+    "        }\n",
+    "        return sample"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Create scaling function\n",
+    "targets_transform_fn = create_targets_scaling_fn(bigwig_path_list)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create datasets & dataloaders\n",
+    "create_dataset_fn = functools.partial(\n",
+    "    GenomeBigWigDataset,\n",
+    "    fasta_path=fasta_path,\n",
+    "    bigwig_path_list=bigwig_path_list,\n",
+    "    sequence_length=config[\"sequence_length\"],\n",
+    "    tokenizer=tokenizer,\n",
+    "    transform_fn=targets_transform_fn,\n",
+    "    keep_target_center_fraction=config[\"keep_target_center_fraction\"],\n",
+    ")\n",
     "\n",
+    "train_dataset = create_dataset_fn(\n",
+    "    chroms=chrom_splits[\"train\"],\n",
+    "    num_samples=config[\"num_steps_training\"] * config[\"batch_size\"],\n",
     ")\n",
     "\n",
+    "val_dataset = create_dataset_fn(\n",
+    "    chroms=chrom_splits[\"val\"],\n",
+    "    num_samples=config[\"num_validation_samples\"],\n",
+    ")\n",
     "\n",
+    "test_dataset = create_dataset_fn(\n",
+    "    chroms=chrom_splits[\"test\"],\n",
+    "    num_samples=config[\"num_test_samples\"],\n",
+    ")\n",
     "\n",
+    "# Create dataloaders\n",
+    "train_loader = DataLoader(\n",
+    "    train_dataset,\n",
+    "    batch_size=config[\"batch_size\"],\n",
+    "    shuffle=True,\n",
+    "    num_workers=config[\"num_workers\"],\n",
+    ")\n",
+    "\n",
+    "val_loader = DataLoader(\n",
+    "    val_dataset,\n",
+    "    batch_size=config[\"batch_size\"],\n",
+    "    shuffle=False,\n",
+    "    num_workers=config[\"num_workers\"],\n",
+    ")\n",
     "\n",
+    "test_loader = DataLoader(\n",
+    "    test_dataset,\n",
+    "    batch_size=config[\"batch_size\"],\n",
+    "    shuffle=False,\n",
+    "    num_workers=config[\"num_workers\"],\n",
+    ")\n",
+    "\n",
+    "print(f\"Train samples: {len(train_dataset)}\")\n",
+    "print(f\"Val samples: {len(val_dataset)}\")\n",
+    "print(f\"Test samples: {len(test_dataset)}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 5. \u2699\ufe0f Optimizer setup\n",
     "\n",
+    "Configure the AdamW optimizer with learning rate and weight decay hyperparameters. This optimizer will update the model parameters during training to minimize the loss function.\n",
+    "\n"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "# Training setup\n",
     "print(f\"Training configuration:\")\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 6. \ud83d\udcca Metrics setup\n",
     "\n",
     "Set up evaluation metrics to track model performance during training and validation. We use Pearson correlation coefficients to measure how well the predicted BigWig signals match the ground truth signals."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 7. \ud83d\udcc9 Loss functions\n",
     "\n",
     "Define the Poisson-Multinomial loss function that captures both the scale (total signal) and shape (distribution) of BigWig tracks. This loss is specifically designed for count-based genomic signal data."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 8. \ud83c\udfc3 Training loop\n",
     "\n",
     "Run the main training loop that iterates through batches, computes gradients, and updates model parameters. The loop includes periodic validation checks and real-time metric visualization to monitor training progress."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 9. \ud83e\uddea Test evaluation\n",
     "\n",
     "Evaluate the fine-tuned model on the held-out test set to assess final performance. This provides an unbiased estimate of how well the model generalizes to unseen genomic regions."
    ]
  },
  "nbformat": 4,
  "nbformat_minor": 2
+}