Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

ybornachot commited on Dec 15, 2025

Commit

a30b8d8

1 Parent(s): 3ffcb7a

feat: link to HF dataset to abstract data pipeline

Browse files

Files changed (1) hide show

notebooks/03_fine_tuning.ipynb +569 -501

notebooks/03_fine_tuning.ipynb CHANGED Viewed

@@ -4,13 +4,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# \ud83e\uddec Fine-Tuning a Model on BigWig Tracks Prediction\n",
     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
-    "**\u26a1 Key Advantage**: This simplified pipeline achieves **close performance to more complex training approaches** while enabling **relatively fast fine-tuning in approximately one hour**. The setup is designed for rapid experimentation and iteration, making it ideal for adapting pre-trained models to your specific genomic tracks or experimental conditions without the overhead of complex distributed training infrastructure.\n",
     "\n",
-    "**\ud83d\udd27 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "\n",
     "- **Data splits**: Uses simple chromosome-based train/val/test splits (e.g., assigning entire chromosomes to each split) instead of more complex region-based splits\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
@@ -30,24 +30,417 @@
     "\n",
     "If you're interested in using pre-trained models for inference without fine-tuning, or exploring different model architectures, please refer to other notebooks in this collection. This notebook focuses specifically on the simplified fine-tuning process, which is useful when you want to quickly adapt a pre-trained model to your specific genomic tracks or improve performance on particular cell types or experimental conditions.\n",
     "\n",
-    "\ud83d\udcdd Note for Google Colab users: This notebook is compatible with Colab! For faster training, make sure to enable GPU: Runtime \u2192 Change runtime type \u2192 GPU (T4 or better recommended).\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
     "# Install dependencies\n",
-    "!pip install pyfaidx pyBigWig torchmetrics transformers plotly"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 1. \ud83d\udce6 Imports + Configuration\n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
@@ -81,43 +474,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 0. Imports\n",
-    "import random\n",
-    "import functools\n",
-    "from typing import List, Dict, Callable\n",
-    "import os\n",
-    "import subprocess\n",
-    "\n",
-    "import torch\n",
-    "import torch.nn as nn\n",
-    "import torch.nn.functional as F\n",
-    "from torch.utils.data import Dataset, DataLoader\n",
-    "from torch.optim import AdamW\n",
-    "from transformers import AutoConfig, AutoModelForMaskedLM, AutoTokenizer\n",
-    "import numpy as np\n",
-    "import pyBigWig\n",
-    "from pyfaidx import Fasta\n",
-    "from torchmetrics import PearsonCorrCoef\n",
-    "import plotly.graph_objects as go\n",
-    "from IPython.display import display\n",
-    "from tqdm import tqdm"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "config = {\n",
     "    # Model\n",
     "    \"model_name\": \"InstaDeepAI/NTv3_8M_pre\",\n",
     "    \n",
-    "    # Data\n",
     "    \"data_cache_dir\": \"./data\",\n",
     "    \"fasta_url\": \"https://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz\",\n",
     "    \"bigwig_url_list\": [\n",
@@ -152,25 +526,8 @@
     "\n",
     "os.makedirs(config[\"data_cache_dir\"], exist_ok=True)\n",
     "\n",
-    "# Extract filenames from URLs\n",
-    "def extract_filename_from_url(url: str) -> str:\n",
-    "    \"\"\"Extract filename from URL, handling query parameters.\"\"\"\n",
-    "    # Remove query parameters if present\n",
-    "    url_clean = url.split('?')[0]\n",
-    "    # Get the last part of the URL path\n",
-    "    return url_clean.split('/')[-1]\n",
-    "\n",
-    "# Create paths for downloaded files\n",
-    "fasta_path = os.path.join(config[\"data_cache_dir\"], extract_filename_from_url(config[\"fasta_url\"]).replace('.gz', ''))\n",
-    "bigwig_path_list = [\n",
-    "    os.path.join(config[\"data_cache_dir\"], extract_filename_from_url(url))\n",
-    "    for url in config[\"bigwig_url_list\"]\n",
-    "]\n",
-    "\n",
     "# Create bigwig_file_ids from filenames (without extension)\n",
     "config[\"bigwig_file_ids\"] = [\n",
-    "    # os.path.splitext(extract_filename_from_url(url))[0]\n",
-    "    # for url in config[\"bigwig_url_list\"]\n",
     "    \"ENCSR325NFE\",\n",
     "    \"ENCSR962OTG\",\n",
     "    \"ENCSR619DQO_P\",\n",
@@ -190,67 +547,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 2. \ud83d\udce5 Genome & Tracks Data Download\n",
-    "\n",
-    "Download the reference genome FASTA file and BigWig signal tracks from public repositories. These files contain the genomic sequences and experimental signal data (e.g., ChIP-seq, ATAC-seq) that we'll use for training."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Download fasta file\n",
-    "fasta_filename = extract_filename_from_url(config[\"fasta_url\"])\n",
-    "fasta_gz_path = os.path.join(config[\"data_cache_dir\"], fasta_filename)\n",
-    "\n",
-    "print(f\"Downloading {fasta_filename}...\")\n",
-    "subprocess.run([\"wget\", \"-c\", config[\"fasta_url\"], \"-O\", fasta_gz_path], check=True)\n",
-    "\n",
-    "print(f\"Extracting {fasta_filename}...\")\n",
-    "subprocess.run([\"gunzip\", \"-f\", fasta_gz_path], check=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Download bigwig files\n",
-    "for bigwig_url in config[\"bigwig_url_list\"]:\n",
-    "    filename = extract_filename_from_url(bigwig_url)\n",
-    "    filepath = os.path.join(config[\"data_cache_dir\"], filename)\n",
-    "    print(f\"Downloading {filename}...\")\n",
-    "    subprocess.run([\"wget\", \"-c\", bigwig_url, \"-O\", filepath], check=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Data Splits Definition"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "chrom_splits = {\n",
-    "    \"train\": [f\"chr{i}\" for i in range(1, 21)] + ['chrX', 'chrY'],\n",
-    "    \"val\": ['chr22'],\n",
-    "    \"test\": ['chr21']\n",
-    "}"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# 3. \ud83e\udde0 Model and tokenizer setup\n",
     " \n",
     "In this section, we set up the model and tokenizer. \n",
     " \n",
@@ -260,12 +557,12 @@
     "This linear head is trained for regression on a set of genomic tracks, \n",
     "allowing the model to make predictions for each track at single nucleotide resolution.\n",
     " \n",
-    "The following code wraps the HuggingFace model together with this regression head for the end-to-end task.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -333,9 +630,19 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
     "# Load tokenizer\n",
     "tokenizer = AutoTokenizer.from_pretrained(config[\"model_name\"], trust_remote_code=True)\n",
@@ -351,341 +658,84 @@
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(config['bigwig_file_ids'])}\")\n",
-    "print(f\"Model parameters: {sum(p.numel() for p in model.parameters()):,}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Scaling functions for targets\n",
-    "def compute_chromosome_stats(track_data: np.ndarray) -> dict:\n",
-    "    \"\"\"\n",
-    "    Compute minimal statistics needed for weighted mean computation.\n",
-    "    \n",
-    "    Args:\n",
-    "        track_data: numpy array of track values for a chromosome\n",
-    "        \n",
-    "    Returns:\n",
-    "        Dictionary with statistics: sum, mean, total_count\n",
-    "    \"\"\"\n",
-    "    track_data = track_data.astype(np.float32)\n",
-    "    \n",
-    "    # Compute statistics\n",
-    "    sum_all = np.sum(track_data)\n",
-    "    total_count = track_data.size\n",
-    "    mean_all = sum_all / total_count if total_count > 0 else 0.0\n",
-    "    \n",
-    "    return {\n",
-    "        \"sum\": sum_all,\n",
-    "        \"mean\": mean_all,\n",
-    "        \"total_count\": total_count,\n",
-    "    }\n",
-    "\n",
-    "\n",
-    "def aggregate_file_statistics(chr_stats_list: List[dict]) -> dict:\n",
-    "    \"\"\"\n",
-    "    Aggregate chromosome-level statistics into file-level statistics.\n",
-    "    \n",
-    "    Args:\n",
-    "        chr_stats_list: List of dictionaries, each containing chromosome-level statistics\n",
-    "        \n",
-    "    Returns:\n",
-    "        Dictionary with aggregated file-level statistics (only mean)\n",
-    "    \"\"\"\n",
-    "    # Convert to arrays for easier computation\n",
-    "    total_counts = np.array([s[\"total_count\"] for s in chr_stats_list], dtype=np.int64)\n",
-    "    means = np.array([s[\"mean\"] for s in chr_stats_list], dtype=np.float32)\n",
-    "    sums = np.array([s[\"sum\"] for s in chr_stats_list], dtype=np.float32)\n",
-    "    \n",
-    "    # Aggregate total count\n",
-    "    total_count = np.sum(total_counts)\n",
-    "    \n",
-    "    # Weighted mean: mean = sum(mean_chr * total_count_chr) / sum(total_count_chr)\n",
-    "    mean = np.sum(means * total_counts) / total_count if total_count > 0 else 0.0\n",
-    "    \n",
-    "    return {\n",
-    "        \"total_count\": total_count,\n",
-    "        \"sum\": np.sum(sums),\n",
-    "        \"mean\": mean,\n",
-    "    }\n",
-    "\n",
-    "\n",
-    "def get_track_means(bigwig_tracks_list: List[pyBigWig.pyBigWig]) -> np.ndarray:\n",
-    "    \"\"\"\n",
-    "    Get track means for normalization.\n",
-    "    Computes statistics per chromosome and aggregates using weighted averaging,\n",
-    "    \n",
-    "    Args:\n",
-    "        bigwig_tracks_list: List of pyBigWig file objects\n",
-    "        \n",
-    "    Returns:\n",
-    "        Array of track means, one per bigwig file\n",
-    "    \"\"\"\n",
-    "    track_means = []\n",
-    "    \n",
-    "    for bigwig_track in bigwig_tracks_list:\n",
-    "        chrom_lengths = bigwig_track.chroms()\n",
-    "        all_chr_stats = []\n",
-    "        \n",
-    "        # Compute statistics for each chromosome\n",
-    "        for chrom_name, chrom_length in chrom_lengths.items():\n",
-    "            try:\n",
-    "                # Get chromosome data as numpy array\n",
-    "                bw_array = np.array(\n",
-    "                    bigwig_track.values(chrom_name, 0, chrom_length, numpy=True),\n",
-    "                    dtype=np.float32\n",
-    "                )\n",
-    "                # Replace NaN with 0\n",
-    "                bw_array = np.nan_to_num(bw_array, nan=0.0)\n",
-    "                \n",
-    "                # Compute chromosome-level statistics\n",
-    "                chr_stats = compute_chromosome_stats(bw_array)\n",
-    "                all_chr_stats.append(chr_stats)\n",
-    "            except Exception as e:\n",
-    "                # Skip chromosomes that fail to load\n",
-    "                print(f\"Warning: Failed to load chromosome {chrom_name}: {e}\")\n",
-    "                continue\n",
-    "        \n",
-    "        if not all_chr_stats:\n",
-    "            raise ValueError(f\"No valid chromosomes found for bigwig track\")\n",
-    "        \n",
-    "        # Aggregate chromosome-level stats into file-level stats\n",
-    "        file_stats = aggregate_file_statistics(all_chr_stats)\n",
-    "        \n",
-    "        # Use the weighted mean for normalization\n",
-    "        track_means.append(file_stats[\"mean\"])\n",
-    "    \n",
-    "    return np.array(track_means, dtype=np.float32)\n",
-    "\n",
-    "\n",
-    "def create_targets_scaling_fn(bigwig_path_list: List[str]) -> Callable[[torch.Tensor], torch.Tensor]:\n",
-    "    \"\"\"\n",
-    "    Build a scaling function based on track means computed from bigwig files.\n",
-    "    \n",
-    "    Opens bigwig files, computes track statistics, and creates a transform function.\n",
-    "    The statistics are computed once and reused for all calls to the returned transform function.\n",
-    "    \n",
-    "    Args:\n",
-    "        bigwig_path_list: List of paths to bigwig files\n",
-    "        \n",
-    "    Returns:\n",
-    "        Transform function that scales input tensors\n",
-    "    \"\"\"\n",
-    "    # Open bigwig files and compute track statistics\n",
-    "    print(\"Computing track statistics (this may take a while)...\")\n",
-    "    bw_list = [\n",
-    "        pyBigWig.open(bigwig_path)\n",
-    "        for bigwig_path in bigwig_path_list\n",
-    "    ]\n",
-    "    track_means = get_track_means(bw_list)\n",
-    "    print(f\"Computed track means: {track_means}\")\n",
-    "    print(f\"Track means shape: {track_means.shape}\")\n",
-    "    \n",
-    "    # Create tensor from computed means\n",
-    "    track_means_tensor = torch.tensor(track_means, dtype=torch.float32)\n",
-    "    \n",
-    "    def transform_fn(x: torch.Tensor) -> torch.Tensor:\n",
-    "        \"\"\"\n",
-    "        x: torch.Tensor, shape (seq_len, num_tracks) or (batch, seq_len, num_tracks)\n",
-    "        \"\"\"\n",
-    "        # Move constants to correct device then normalize\n",
-    "        means = track_means_tensor.to(x.device)\n",
-    "        scaled = x / means\n",
-    "\n",
-    "        # Smooth clipping: if > 10, apply formula\n",
-    "        clipped = torch.where(\n",
-    "            scaled > 10.0,\n",
-    "            2.0 * torch.sqrt(scaled * 10.0) - 10.0,\n",
-    "            scaled,\n",
-    "        )\n",
-    "        return clipped\n",
-    "    \n",
-    "    return transform_fn"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 4. \ud83d\udd04 Data loading\n",
     "\n",
-    "Create PyTorch datasets and data loaders that efficiently sample random genomic windows from the reference genome and extract corresponding BigWig signal values. The dataset handles sequence tokenization, target scaling, and chromosome-based train/val/test splits."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# Process-local cache for BigWig file handles (one per worker process)\n",
-    "# This allows safe multi-worker DataLoader usage\n",
-    "_bigwig_cache = {}  # Maps (process_id, file_path) -> pyBigWig handle\n",
-    "\n",
-    "\n",
-    "def _get_bigwig_handle(bigwig_path: str) -> pyBigWig.pyBigWig:\n",
-    "    \"\"\"Get or create a BigWig file handle for the current process.\"\"\"\n",
-    "    process_id = os.getpid()\n",
-    "    cache_key = (process_id, bigwig_path)\n",
-    "    \n",
-    "    if cache_key not in _bigwig_cache:\n",
-    "        _bigwig_cache[cache_key] = pyBigWig.open(bigwig_path)\n",
-    "    \n",
-    "    return _bigwig_cache[cache_key]\n",
-    "\n",
-    "\n",
-    "class GenomeBigWigDataset(Dataset):\n",
-    "    \"\"\"\n",
-    "    Random genomic windows from a reference genome + bigWig signal.\n",
-    "\n",
-    "    Each sample:\n",
-    "        - picks a chromosome/region (from `chroms` or `regions`),\n",
-    "        - picks a random window of length `sequence_length`,\n",
-    "        - returns (sequence, signal, chrom, start, end).\n",
-    "\n",
-    "    This dataset is compatible with multi-worker DataLoaders. BigWig files\n",
-    "    are opened lazily using a process-local cache, ensuring each worker process\n",
-    "    has its own file handles and avoiding concurrent access issues.\n",
-    "\n",
-    "    Args\n",
-    "    ----\n",
-    "    fasta_path : str\n",
-    "        Path to the reference genome FASTA (e.g. hg38.fna).\n",
-    "    bigwig_path_list : str\n",
-    "        Path to the bigWig file (e.g. ENCFF884LDL.bigWig).\n",
-    "    chroms : List[str]\n",
-    "        Chromosome names as they appear in the bigWig (e.g. [\"chr1\", \"chr2\", ...]).\n",
-    "        Used for backward compatibility or when regions=None.\n",
-    "    sequence_length : int\n",
-    "        Length of each random window (in bp).\n",
-    "    num_samples : int\n",
-    "        Number of samples the dataset will provide (len(dataset)).\n",
-    "    tokenizer : AutoTokenizer\n",
-    "        Tokenizer to use for tokenization.\n",
-    "    transform_fn : Callable\n",
-    "        Function to transform/scaling bigwig targets.\n",
-    "    keep_target_center_fraction : float\n",
-    "        Fraction of center sequence to keep for target prediction (crops edges to focus on center).\n",
-    "    regions : List[tuple[str, int, int]] | None\n",
-    "        Optional list of regions as (chromosome, start, end) tuples.\n",
-    "        If provided, samples are drawn randomly from within these regions only.\n",
-    "        This matches the JAX pipeline approach using BED file splits.\n",
-    "        If None, samples from entire chromosomes in `chroms`.\n",
-    "    \"\"\"\n",
-    "\n",
-    "    def __init__(\n",
-    "        self,\n",
-    "        fasta_path: str,\n",
-    "        bigwig_path_list: list[str],\n",
-    "        chroms: List[str],\n",
-    "        sequence_length: int,\n",
-    "        num_samples: int,\n",
-    "        tokenizer: AutoTokenizer,\n",
-    "        transform_fn: Callable[[torch.Tensor], torch.Tensor],\n",
-    "        keep_target_center_fraction: float = 1.0,\n",
-    "    ):\n",
-    "        super().__init__()\n",
-    "\n",
-    "        self.fasta = Fasta(fasta_path, as_raw=True, sequence_always_upper=True)\n",
-    "        # Store paths instead of opening files immediately (for multi-worker compatibility)\n",
-    "        self.bigwig_path_list = bigwig_path_list\n",
-    "        self.sequence_length = sequence_length\n",
-    "        self.num_samples = num_samples\n",
-    "        self.tokenizer = tokenizer\n",
-    "        self.transform_fn = transform_fn  # Use pre-computed transform function\n",
-    "        self.keep_target_center_fraction = keep_target_center_fraction\n",
-    "        self.chroms = chroms\n",
-    "\n",
-    "        # Get chromosome lengths from first BigWig file (lazy, cached per process)\n",
-    "        # We need this for validation, so open temporarily\n",
-    "        bw_handle = _get_bigwig_handle(bigwig_path_list[0])\n",
-    "        bw_chrom_lengths = bw_handle.chroms()  # dict: chrom -> length\n",
-    "\n",
-    "        self.valid_chroms = []\n",
-    "        self.chrom_lengths = {}\n",
-    "\n",
-    "        for c in chroms:\n",
-    "            if c not in bw_chrom_lengths or c not in self.fasta:\n",
-    "                continue\n",
-    "\n",
-    "            fa_len = len(self.fasta[c])\n",
-    "            bw_len = bw_chrom_lengths[c]\n",
-    "            L = min(fa_len, bw_len)\n",
-    "\n",
-    "            if L > self.sequence_length:\n",
-    "                self.valid_chroms.append(c)\n",
-    "                self.chrom_lengths[c] = L\n",
-    "\n",
-    "        if not self.valid_chroms:\n",
-    "            raise ValueError(\"No valid chromosomes after intersecting FASTA and bigWig.\")\n",
-    "\n",
-    "    def __len__(self):\n",
-    "        return self.num_samples\n",
-    "\n",
-    "    def __getitem__(self, idx):\n",
-    "\n",
-    "        # Sample from entire chromosomes\n",
-    "        chrom = random.choice(self.valid_chroms)\n",
-    "        chrom_len = self.chrom_lengths[chrom]\n",
-    "        max_start = chrom_len - self.sequence_length\n",
-    "        start = random.randint(0, max_start)\n",
-    "        end = start + self.sequence_length\n",
-    "\n",
-    "        # Sequence\n",
-    "        seq = self.fasta[chrom][start:end]  # string slice\n",
-    "        # Tokenize with padding and truncation to ensure consistent lengths for batching\n",
-    "        tokenized = self.tokenizer(\n",
-    "            seq,\n",
-    "            padding=\"max_length\",\n",
-    "            truncation=True,\n",
-    "            max_length=self.sequence_length,\n",
-    "            return_tensors=\"pt\",\n",
-    "        )\n",
-    "        tokens = tokenized[\"input_ids\"][0]  # Shape: (max_length,)\n",
-    "\n",
-    "        # Signal from bigWig tracks (numpy array) -> torch tensor\n",
-    "        # Get BigWig handles lazily (cached per worker process)\n",
-    "        bigwig_targets = np.array([\n",
-    "            _get_bigwig_handle(bw_path).values(chrom, start, end, numpy=True)\n",
-    "            for bw_path in self.bigwig_path_list\n",
-    "        ])  # shape (num_tracks, seq_len)\n",
-    "        # Transpose to (seq_len, num_tracks)\n",
-    "        bigwig_targets = bigwig_targets.T\n",
-    "        # pyBigWig returns NaN where no data; turn NaN into 0\n",
-    "        bigwig_targets = torch.tensor(bigwig_targets, dtype=torch.float32)\n",
-    "        bigwig_targets = torch.nan_to_num(bigwig_targets, nan=0.0)\n",
-    "        \n",
-    "        # Crop targets to center fraction\n",
-    "        if self.keep_target_center_fraction < 1.0:\n",
-    "            seq_len = bigwig_targets.shape[0]  # First dimension is sequence length\n",
-    "            target_offset = int(seq_len * (1 - self.keep_target_center_fraction) // 2)\n",
-    "            target_length = seq_len - 2 * target_offset\n",
-    "            bigwig_targets = bigwig_targets[target_offset:target_offset + target_length, :]\n",
     "\n",
-    "        # Apply scaling to targets\n",
-    "        bigwig_targets = self.transform_fn(bigwig_targets)\n",
     "\n",
-    "        sample = {\n",
-    "            \"tokens\": tokens,\n",
-    "            \"bigwig_targets\": bigwig_targets,\n",
-    "            \"chrom\": chrom,\n",
-    "            \"start\": start,\n",
-    "            \"end\": end,\n",
-    "        }\n",
-    "        return sample"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Create scaling function\n",
-    "targets_transform_fn = create_targets_scaling_fn(bigwig_path_list)"
    ]
   },
   {
@@ -694,74 +744,92 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Create datasets & dataloaders\n",
-    "create_dataset_fn = functools.partial(\n",
-    "    GenomeBigWigDataset,\n",
-    "    fasta_path=fasta_path,\n",
-    "    bigwig_path_list=bigwig_path_list,\n",
-    "    sequence_length=config[\"sequence_length\"],\n",
-    "    tokenizer=tokenizer,\n",
-    "    transform_fn=targets_transform_fn,\n",
-    "    keep_target_center_fraction=config[\"keep_target_center_fraction\"],\n",
-    ")\n",
-    "\n",
-    "train_dataset = create_dataset_fn(\n",
-    "    chroms=chrom_splits[\"train\"],\n",
-    "    num_samples=config[\"num_steps_training\"] * config[\"batch_size\"],\n",
-    ")\n",
-    "\n",
-    "val_dataset = create_dataset_fn(\n",
-    "    chroms=chrom_splits[\"val\"],\n",
-    "    num_samples=config[\"num_validation_samples\"],\n",
-    ")\n",
     "\n",
-    "test_dataset = create_dataset_fn(\n",
-    "    chroms=chrom_splits[\"test\"],\n",
-    "    num_samples=config[\"num_test_samples\"],\n",
     ")\n",
     "\n",
-    "# Create dataloaders\n",
-    "train_loader = DataLoader(\n",
-    "    train_dataset,\n",
-    "    batch_size=config[\"batch_size\"],\n",
-    "    shuffle=True,\n",
-    "    num_workers=config[\"num_workers\"],\n",
-    ")\n",
     "\n",
-    "val_loader = DataLoader(\n",
-    "    val_dataset,\n",
-    "    batch_size=config[\"batch_size\"],\n",
-    "    shuffle=False,\n",
-    "    num_workers=config[\"num_workers\"],\n",
-    ")\n",
     "\n",
-    "test_loader = DataLoader(\n",
-    "    test_dataset,\n",
-    "    batch_size=config[\"batch_size\"],\n",
-    "    shuffle=False,\n",
-    "    num_workers=config[\"num_workers\"],\n",
-    ")\n",
     "\n",
-    "print(f\"Train samples: {len(train_dataset)}\")\n",
-    "print(f\"Val samples: {len(val_dataset)}\")\n",
-    "print(f\"Test samples: {len(test_dataset)}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 5. \u2699\ufe0f Optimizer setup\n",
     "\n",
-    "Configure the AdamW optimizer with learning rate and weight decay hyperparameters. This optimizer will update the model parameters during training to minimize the loss function.\n",
-    "\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
     "# Training setup\n",
     "print(f\"Training configuration:\")\n",
@@ -785,14 +853,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 6. \ud83d\udcca Metrics setup\n",
     "\n",
     "Set up evaluation metrics to track model performance during training and validation. We use Pearson correlation coefficients to measure how well the predicted BigWig signals match the ground truth signals."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -865,7 +933,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -878,14 +946,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 7. \ud83d\udcc9 Loss functions\n",
     "\n",
     "Define the Poisson-Multinomial loss function that captures both the scale (total signal) and shape (distribution) of BigWig tracks. This loss is specifically designed for count-based genomic signal data."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -960,14 +1028,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 8. \ud83c\udfc3 Training loop\n",
     "\n",
     "Run the main training loop that iterates through batches, computes gradients, and updates model parameters. The loop includes periodic validation checks and real-time metric visualization to monitor training progress."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1035,7 +1103,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1192,7 +1260,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 9. \ud83e\uddea Test evaluation\n",
     "\n",
     "Evaluate the fine-tuned model on the held-out test set to assess final performance. This provides an unbiased estimate of how well the model generalizes to unseen genomic regions."
    ]
@@ -1270,4 +1338,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 🧬 Fine-Tuning a Model on BigWig Tracks Prediction\n",
     "\n",
     "This notebook demonstrates a **simplified fine-tuning setup** that enables training of a pre-trained Nucleotide Transformer v3 (NTv3) model to predict BigWig signal tracks directly from DNA sequences. The streamlined approach leverages a pre-trained NTv3 backbone as a feature extractor and adds a custom prediction head that outputs single-nucleotide resolution signal values for various genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq).\n",
     "\n",
+    "**⚡ Key Advantage**: This simplified pipeline achieves **close performance to more complex training approaches** while enabling **relatively fast fine-tuning in approximately one hour**. The setup is designed for rapid experimentation and iteration, making it ideal for adapting pre-trained models to your specific genomic tracks or experimental conditions without the overhead of complex distributed training infrastructure.\n",
     "\n",
+    "**🔧 Main Simplifications**: Compared to the full supervised tracks pipeline, this notebook simplifies several aspects to enable faster iteration:\n",
     "\n",
     "- **Data splits**: Uses simple chromosome-based train/val/test splits (e.g., assigning entire chromosomes to each split) instead of more complex region-based splits\n",
     "- **Random sequence sampling**: The dataset randomly samples sequences from chromosomes/regions on-the-fly, rather than using pre-computed sliding windows\n",
     "\n",
     "If you're interested in using pre-trained models for inference without fine-tuning, or exploring different model architectures, please refer to other notebooks in this collection. This notebook focuses specifically on the simplified fine-tuning process, which is useful when you want to quickly adapt a pre-trained model to your specific genomic tracks or improve performance on particular cell types or experimental conditions.\n",
     "\n",
+    "📝 Note for Google Colab users: This notebook is compatible with Colab! For faster training, make sure to enable GPU: Runtime → Change runtime type → GPU (T4 or better recommended).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 0. 📦 Imports"
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 1,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collecting datasets\n",
+      "  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)\n",
+      "Requirement already satisfied: transformers in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (4.57.3)\n",
+      "Requirement already satisfied: torchmetrics in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (1.8.2)\n",
+      "Requirement already satisfied: plotly in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (6.5.0)\n",
+      "Requirement already satisfied: filelock in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (3.20.0)\n",
+      "Requirement already satisfied: numpy>=1.17 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (2.3.5)\n",
+      "Collecting pyarrow>=21.0.0 (from datasets)\n",
+      "  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)\n",
+      "Collecting dill<0.4.1,>=0.3.0 (from datasets)\n",
+      "  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)\n",
+      "Collecting pandas (from datasets)\n",
+      "  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)\n",
+      "\u001b[2K     \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m91.2/91.2 kB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: requests>=2.32.2 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (2.32.5)\n",
+      "Collecting httpx<1.0.0 (from datasets)\n",
+      "  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)\n",
+      "Requirement already satisfied: tqdm>=4.66.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (4.67.1)\n",
+      "Collecting xxhash (from datasets)\n",
+      "  Downloading xxhash-3.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)\n",
+      "Collecting multiprocess<0.70.19 (from datasets)\n",
+      "  Downloading multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)\n",
+      "Collecting fsspec<=2025.10.0,>=2023.1.0 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)\n",
+      "Requirement already satisfied: huggingface-hub<2.0,>=0.25.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (0.36.0)\n",
+      "Requirement already satisfied: packaging in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (25.0)\n",
+      "Requirement already satisfied: pyyaml>=5.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from datasets) (6.0.3)\n",
+      "Requirement already satisfied: regex!=2019.12.17 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from transformers) (2025.11.3)\n",
+      "Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from transformers) (0.22.1)\n",
+      "Requirement already satisfied: safetensors>=0.4.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from transformers) (0.7.0)\n",
+      "Requirement already satisfied: torch>=2.0.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torchmetrics) (2.9.1)\n",
+      "Requirement already satisfied: lightning-utilities>=0.8.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torchmetrics) (0.15.2)\n",
+      "Requirement already satisfied: narwhals>=1.15.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from plotly) (2.13.0)\n",
+      "Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading aiohttp-3.13.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)\n",
+      "Collecting anyio (from httpx<1.0.0->datasets)\n",
+      "  Downloading anyio-4.12.0-py3-none-any.whl.metadata (4.3 kB)\n",
+      "Requirement already satisfied: certifi in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (2025.11.12)\n",
+      "Collecting httpcore==1.* (from httpx<1.0.0->datasets)\n",
+      "  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)\n",
+      "Requirement already satisfied: idna in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from httpx<1.0.0->datasets) (3.11)\n",
+      "Collecting h11>=0.16 (from httpcore==1.*->httpx<1.0.0->datasets)\n",
+      "  Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)\n",
+      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets) (4.15.0)\n",
+      "Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets) (1.2.0)\n",
+      "Requirement already satisfied: setuptools in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from lightning-utilities>=0.8.0->torchmetrics) (80.9.0)\n",
+      "Requirement already satisfied: charset_normalizer<4,>=2 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (3.4.4)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from requests>=2.32.2->datasets) (2.6.1)\n",
+      "Requirement already satisfied: sympy>=1.13.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (1.14.0)\n",
+      "Requirement already satisfied: networkx>=2.5.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (3.6.1)\n",
+      "Requirement already satisfied: jinja2 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (3.1.6)\n",
+      "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.93)\n",
+      "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.90)\n",
+      "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.90)\n",
+      "Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (9.10.2.21)\n",
+      "Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.4.1)\n",
+      "Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (11.3.3.83)\n",
+      "Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (10.3.9.90)\n",
+      "Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (11.7.3.90)\n",
+      "Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.5.8.93)\n",
+      "Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (0.7.1)\n",
+      "Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (2.27.5)\n",
+      "Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (3.3.20)\n",
+      "Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.90)\n",
+      "Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (12.8.93)\n",
+      "Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (1.13.1.3)\n",
+      "Requirement already satisfied: triton==3.5.1 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from torch>=2.0.0->torchmetrics) (3.5.1)\n",
+      "Requirement already satisfied: python-dateutil>=2.8.2 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from pandas->datasets) (2.9.0.post0)\n",
+      "Collecting pytz>=2020.1 (from pandas->datasets)\n",
+      "  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)\n",
+      "Collecting tzdata>=2022.7 (from pandas->datasets)\n",
+      "  Downloading tzdata-2025.3-py2.py3-none-any.whl.metadata (1.4 kB)\n",
+      "Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)\n",
+      "Collecting aiosignal>=1.4.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading aiosignal-1.4.0-py3-none-any.whl.metadata (3.7 kB)\n",
+      "Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading attrs-25.4.0-py3-none-any.whl.metadata (10 kB)\n",
+      "Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading frozenlist-1.8.0-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (20 kB)\n",
+      "Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading multidict-6.7.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (5.3 kB)\n",
+      "Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading propcache-0.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)\n",
+      "Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n",
+      "  Downloading yarl-1.22.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (75 kB)\n",
+      "\u001b[2K     \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.1/75.1 kB\u001b[0m \u001b[31m23.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: six>=1.5 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.17.0)\n",
+      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from sympy>=1.13.3->torch>=2.0.0->torchmetrics) (1.3.0)\n",
+      "Requirement already satisfied: MarkupSafe>=2.0 in /home/y-bornachot/venvs/ntv3-env/lib/python3.12/site-packages (from jinja2->torch>=2.0.0->torchmetrics) (3.0.3)\n",
+      "Downloading datasets-4.4.1-py3-none-any.whl (511 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m511.6/511.6 kB\u001b[0m \u001b[31m32.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading dill-0.4.0-py3-none-any.whl (119 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m19.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading fsspec-2025.10.0-py3-none-any.whl (200 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m201.0/201.0 kB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hUsing cached httpx-0.28.1-py3-none-any.whl (73 kB)\n",
+      "Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)\n",
+      "Downloading multiprocess-0.70.18-py312-none-any.whl (150 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m150.3/150.3 kB\u001b[0m \u001b[31m18.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m47.7/47.7 MB\u001b[0m \u001b[31m33.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m[36m0:00:01\u001b[0m\n",
+      "\u001b[?25hDownloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.4 MB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.4/12.4 MB\u001b[0m \u001b[31m43.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m0:01\u001b[0m01\u001b[0m\n",
+      "\u001b[?25hDownloading xxhash-3.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (193 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m193.9/193.9 kB\u001b[0m \u001b[31m19.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading aiohttp-3.13.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.8 MB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m38.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m31m44.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n",
+      "\u001b[?25hDownloading pytz-2025.2-py2.py3-none-any.whl (509 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m509.2/509.2 kB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading tzdata-2025.3-py2.py3-none-any.whl (348 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m348.5/348.5 kB\u001b[0m \u001b[31m40.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading anyio-4.12.0-py3-none-any.whl (113 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m113.4/113.4 kB\u001b[0m \u001b[31m29.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)\n",
+      "Downloading aiosignal-1.4.0-py3-none-any.whl (7.5 kB)\n",
+      "Downloading attrs-25.4.0-py3-none-any.whl (67 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.6/67.6 kB\u001b[0m \u001b[31m18.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading frozenlist-1.8.0-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (242 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━��━━━━━━━━━━━━━\u001b[0m \u001b[32m242.4/242.4 kB\u001b[0m \u001b[31m25.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hUsing cached h11-0.16.0-py3-none-any.whl (37 kB)\n",
+      "Downloading multidict-6.7.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (256 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m256.1/256.1 kB\u001b[0m \u001b[31m23.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading propcache-0.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (221 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m221.6/221.6 kB\u001b[0m \u001b[31m28.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hDownloading yarl-1.22.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (377 kB)\n",
+      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m377.3/377.3 kB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hInstalling collected packages: pytz, xxhash, tzdata, pyarrow, propcache, multidict, h11, fsspec, frozenlist, dill, attrs, anyio, aiohappyeyeballs, yarl, pandas, multiprocess, httpcore, aiosignal, httpx, aiohttp, datasets\n",
+      "  Attempting uninstall: fsspec\n",
+      "    Found existing installation: fsspec 2025.12.0\n",
+      "    Uninstalling fsspec-2025.12.0:\n",
+      "      Successfully uninstalled fsspec-2025.12.0\n",
+      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+      "genomix-research 0.1.0 requires absl-py==2.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires aiobotocore==2.21.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires aioitertools==0.12.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires antlr4-python3-runtime==4.9.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires argon2-cffi==23.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires argon2-cffi-bindings==21.2.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires array-record==0.8.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires arrow==1.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires astunparse==1.6.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires async-lru==2.0.5, which is not installed.\n",
+      "genomix-research 0.1.0 requires babel==2.17.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires beautifulsoup4==4.13.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires biopython==1.85, which is not installed.\n",
+      "genomix-research 0.1.0 requires bleach==6.2.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires boto3==1.37.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires botocore==1.37.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires bravado==11.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires bravado-core==5.16.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires bx-python==0.13.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires cachetools==5.5.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires cffi==1.17.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires cfgv==3.4.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires chex==0.1.88, which is not installed.\n",
+      "genomix-research 0.1.0 requires click==8.1.8, which is not installed.\n",
+      "genomix-research 0.1.0 requires cloudpickle==3.1.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires defusedxml==0.7.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires distlib==0.3.9, which is not installed.\n",
+      "genomix-research 0.1.0 requires distrax>=0.1.5, which is not installed.\n",
+      "genomix-research 0.1.0 requires dm-tree==0.1.9, which is not installed.\n",
+      "genomix-research 0.1.0 requires etils==1.12.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires fastjsonschema==2.21.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires flatbuffers==25.2.10, which is not installed.\n",
+      "genomix-research 0.1.0 requires flax==0.10.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires fqdn==1.5.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires future==1.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires gast==0.6.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires gcsfs==2025.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires gitdb==4.0.12, which is not installed.\n",
+      "genomix-research 0.1.0 requires gitpython==3.1.44, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-api-core==2.24.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-api-python-client==2.165.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-auth==2.38.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-auth-httplib2==0.2.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-auth-oauthlib==1.2.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-cloud-core==2.4.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-cloud-storage==3.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-crc32c==1.6.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-pasta==0.2.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires google-resumable-media==2.7.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires googleapis-common-protos==1.69.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires grain==0.2.11, which is not installed.\n",
+      "genomix-research 0.1.0 requires grigri==0.0.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires grpcio==1.71.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires h5py==3.13.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires httplib2==0.22.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires humanize==4.12.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires hydra-core==1.3.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires identify==2.6.9, which is not installed.\n",
+      "genomix-research 0.1.0 requires importlib-resources==6.5.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires iniconfig==2.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires isoduration==20.11.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jax==0.5.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires jaxlib==0.5.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires jaxtyping==0.2.38, which is not installed.\n",
+      "genomix-research 0.1.0 requires jmespath==1.0.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires json5==0.10.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jsonpointer==3.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jsonref==1.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jsonschema==4.23.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jsonschema-specifications==2024.10.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter==1.1.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter-console==6.6.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter-events==0.12.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter-lsp==2.2.5, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter-server==2.15.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyter-server-terminals==0.5.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyterlab==4.3.6, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyterlab-pygments==0.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires jupyterlab-server==2.27.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires keras>=3.11.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires libclang==18.1.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires markdown==3.7, which is not installed.\n",
+      "genomix-research 0.1.0 requires markdown-it-py==3.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires mdurl==0.1.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires mistune==3.1.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires ml-dtypes==0.5.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires monotonic==1.6, which is not installed.\n",
+      "genomix-research 0.1.0 requires more-itertools==10.6.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires msgpack==1.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires namex==0.0.8, which is not installed.\n",
+      "genomix-research 0.1.0 requires natsort==8.4.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires nbclient==0.10.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires nbconvert==7.16.6, which is not installed.\n",
+      "genomix-research 0.1.0 requires nbformat==5.10.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires ncls==0.0.68, which is not installed.\n",
+      "genomix-research 0.1.0 requires neptune==1.13.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires nodeenv==1.9.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires notebook==7.3.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires notebook-shim==0.2.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires oauthlib==3.2.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires omegaconf==2.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires opt-einsum==3.4.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires optax==0.2.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires optree==0.14.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires orbax==0.1.9, which is not installed.\n",
+      "genomix-research 0.1.0 requires orbax-checkpoint==0.11.8, which is not installed.\n",
+      "genomix-research 0.1.0 requires overrides==7.7.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires pandocfilters==1.5.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires pluggy==1.5.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires pre-commit==4.1.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires prometheus-client==0.21.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires proto-plus==1.26.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires protobuf==4.25.7, which is not installed.\n",
+      "genomix-research 0.1.0 requires pyasn1==0.6.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires pyasn1-modules==0.4.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires pycparser==2.22, which is not installed.\n",
+      "genomix-research 0.1.0 requires pyjwt==2.10.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires pyranges==0.1.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires pysam==0.23.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires pytest==8.3.5, which is not installed.\n",
+      "genomix-research 0.1.0 requires pytest-randomly>=3.16.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires pytest-split>=0.10.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires python-json-logger==3.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires ray[default]>=2.49.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires referencing==0.36.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires requests-oauthlib==2.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires rfc3339-validator==0.1.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires rfc3986-validator==0.1.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires rfc3987==1.3.8, which is not installed.\n",
+      "genomix-research 0.1.0 requires rich==13.9.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires rpds-py==0.23.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires rsa==4.7, which is not installed.\n",
+      "genomix-research 0.1.0 requires s3fs==2025.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires s3transfer==0.11.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires scikit-learn>=1.6.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires scipy==1.15.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires seaborn>=0.13.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires send2trash==1.8.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires simplejson==3.20.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires smmap==5.0.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires sniffio==1.3.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires sorted-nearest==0.0.39, which is not installed.\n",
+      "genomix-research 0.1.0 requires soupsieve==2.6, which is not installed.\n",
+      "genomix-research 0.1.0 requires swagger-spec-validator==3.0.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires tabulate==0.9.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires tenacity>=9.1.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorboard==2.19.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorboard-data-server==0.7.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorboard-plugin-profile==2.20.6, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorflow==2.19.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorflow-io==0.37.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorflow-io-gcs-filesystem==0.37.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires tensorstore==0.1.71, which is not installed.\n",
+      "genomix-research 0.1.0 requires termcolor==2.5.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires terminado==0.18.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires tinycss2==1.4.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires toolz==1.0.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires treescope==0.1.9, which is not installed.\n",
+      "genomix-research 0.1.0 requires types-python-dateutil==2.9.0.20241206, which is not installed.\n",
+      "genomix-research 0.1.0 requires umap-learn>=0.5.9.post2, which is not installed.\n",
+      "genomix-research 0.1.0 requires uri-template==1.3.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires uritemplate==4.1.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires virtualenv==20.29.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires wadler-lindig==0.1.4, which is not installed.\n",
+      "genomix-research 0.1.0 requires waffle==0.4.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires webcolors==24.11.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires webencodings==0.5.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires websocket-client==1.8.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires werkzeug==3.1.3, which is not installed.\n",
+      "genomix-research 0.1.0 requires wheel==0.45.1, which is not installed.\n",
+      "genomix-research 0.1.0 requires wrapt==1.17.2, which is not installed.\n",
+      "genomix-research 0.1.0 requires zipp==3.21.0, which is not installed.\n",
+      "genomix-research 0.1.0 requires aiohappyeyeballs==2.5.0, but you have aiohappyeyeballs 2.6.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires aiohttp==3.11.13, but you have aiohttp 3.13.2 which is incompatible.\n",
+      "genomix-research 0.1.0 requires aiosignal==1.3.2, but you have aiosignal 1.4.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires anyio==4.9.0, but you have anyio 4.12.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires asttokens==3.0.0, but you have asttokens 3.0.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires attrs==25.1.0, but you have attrs 25.4.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires certifi==2025.1.31, but you have certifi 2025.11.12 which is incompatible.\n",
+      "genomix-research 0.1.0 requires charset-normalizer==3.4.1, but you have charset-normalizer 3.4.4 which is incompatible.\n",
+      "genomix-research 0.1.0 requires comm==0.2.2, but you have comm 0.2.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires debugpy==1.8.13, but you have debugpy 1.8.17 which is incompatible.\n",
+      "genomix-research 0.1.0 requires executing==2.2.0, but you have executing 2.2.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires filelock==3.17.0, but you have filelock 3.20.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires frozenlist==1.5.0, but you have frozenlist 1.8.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires fsspec==2025.3.0, but you have fsspec 2025.10.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires h11==0.14.0, but you have h11 0.16.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires httpcore==1.0.7, but you have httpcore 1.0.9 which is incompatible.\n",
+      "genomix-research 0.1.0 requires idna==3.10, but you have idna 3.11 which is incompatible.\n",
+      "genomix-research 0.1.0 requires ipykernel==6.29.5, but you have ipykernel 7.1.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires ipython==9.0.2, but you have ipython 9.8.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires ipywidgets==8.1.5, but you have ipywidgets 8.1.8 which is incompatible.\n",
+      "genomix-research 0.1.0 requires jupyter-core==5.7.2, but you have jupyter-core 5.9.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires jupyterlab-widgets==3.0.13, but you have jupyterlab-widgets 3.0.16 which is incompatible.\n",
+      "genomix-research 0.1.0 requires markupsafe==3.0.2, but you have markupsafe 3.0.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires matplotlib-inline==0.1.7, but you have matplotlib-inline 0.2.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires multidict==6.1.0, but you have multidict 6.7.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires numpy==2.1.3, but you have numpy 2.3.5 which is incompatible.\n",
+      "genomix-research 0.1.0 requires packaging==24.2, but you have packaging 25.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pandas==2.2.3, but you have pandas 2.3.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires parso==0.8.4, but you have parso 0.8.5 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pillow==11.1.0, but you have pillow 12.0.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires platformdirs==4.3.6, but you have platformdirs 4.5.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires prompt-toolkit==3.0.50, but you have prompt-toolkit 3.0.52 which is incompatible.\n",
+      "genomix-research 0.1.0 requires propcache==0.3.0, but you have propcache 0.4.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires psutil==7.0.0, but you have psutil 7.1.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pygments==2.19.1, but you have pygments 2.19.2 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pyparsing==3.2.1, but you have pyparsing 3.2.5 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pytz==2025.1, but you have pytz 2025.2 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pyyaml==6.0.2, but you have pyyaml 6.0.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires pyzmq==26.3.0, but you have pyzmq 27.1.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires regex==2024.11.6, but you have regex 2025.11.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires requests==2.32.3, but you have requests 2.32.5 which is incompatible.\n",
+      "genomix-research 0.1.0 requires setuptools==77.0.1, but you have setuptools 80.9.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires tornado==6.4.2, but you have tornado 6.5.2 which is incompatible.\n",
+      "genomix-research 0.1.0 requires typing-extensions==4.12.2, but you have typing-extensions 4.15.0 which is incompatible.\n",
+      "genomix-research 0.1.0 requires tzdata==2025.1, but you have tzdata 2025.3 which is incompatible.\n",
+      "genomix-research 0.1.0 requires urllib3==2.3.0, but you have urllib3 2.6.1 which is incompatible.\n",
+      "genomix-research 0.1.0 requires wcwidth==0.2.13, but you have wcwidth 0.2.14 which is incompatible.\n",
+      "genomix-research 0.1.0 requires widgetsnbextension==4.0.13, but you have widgetsnbextension 4.0.15 which is incompatible.\n",
+      "genomix-research 0.1.0 requires yarl==1.18.3, but you have yarl 1.22.0 which is incompatible.\u001b[0m\u001b[31m\n",
+      "\u001b[0mSuccessfully installed aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 anyio-4.12.0 attrs-25.4.0 datasets-4.4.1 dill-0.4.0 frozenlist-1.8.0 fsspec-2025.10.0 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 multidict-6.7.0 multiprocess-0.70.18 pandas-2.3.3 propcache-0.4.1 pyarrow-22.0.0 pytz-2025.2 tzdata-2025.3 xxhash-3.6.0 yarl-1.22.0\n"
+     ]
+    }
+   ],
    "source": [
     "# Install dependencies\n",
+    "!pip install datasets transformers torchmetrics plotly "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Imports\n",
+    "from typing import List, Dict\n",
+    "import os\n",
+    "\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "from torch.utils.data import DataLoader\n",
+    "from torch.optim import AdamW\n",
+    "from transformers import AutoConfig, AutoModelForMaskedLM, AutoTokenizer\n",
+    "from datasets import load_dataset\n",
+    "import numpy as np\n",
+    "from torchmetrics import PearsonCorrCoef\n",
+    "import plotly.graph_objects as go\n",
+    "from IPython.display import display\n",
+    "from tqdm import tqdm"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 1. ⚙️ Configuration\n",
     "\n",
     "## Configuration Parameters\n",
     "\n",
   },
   {
    "cell_type": "code",
+   "execution_count": 13,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Using device: cpu\n"
+     ]
+    }
+   ],
    "source": [
     "config = {\n",
     "    # Model\n",
     "    \"model_name\": \"InstaDeepAI/NTv3_8M_pre\",\n",
     "    \n",
+    "    # Data - Hugging Face Dataset Configuration\n",
+    "    \"dataset_name\": \"InstaDeepAI/bigwig_tracks\",  # Hugging Face dataset name or path to script\n",
     "    \"data_cache_dir\": \"./data\",\n",
     "    \"fasta_url\": \"https://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz\",\n",
     "    \"bigwig_url_list\": [\n",
     "\n",
     "os.makedirs(config[\"data_cache_dir\"], exist_ok=True)\n",
     "\n",
     "# Create bigwig_file_ids from filenames (without extension)\n",
     "config[\"bigwig_file_ids\"] = [\n",
     "    \"ENCSR325NFE\",\n",
     "    \"ENCSR962OTG\",\n",
     "    \"ENCSR619DQO_P\",\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 2. 🧠 Model and tokenizer setup\n",
     " \n",
     "In this section, we set up the model and tokenizer. \n",
     " \n",
     "This linear head is trained for regression on a set of genomic tracks, \n",
     "allowing the model to make predictions for each track at single nucleotide resolution.\n",
     " \n",
+    "The following code wraps the HuggingFace model together with this regression head for the end-to-end task."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 10,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Model loaded: InstaDeepAI/NTv3_8M_pre\n",
+      "Number of bigwig tracks: 4\n",
+      "Model parameters: 7,694,015\n"
+     ]
+    }
+   ],
    "source": [
     "# Load tokenizer\n",
     "tokenizer = AutoTokenizer.from_pretrained(config[\"model_name\"], trust_remote_code=True)\n",
     "\n",
     "print(f\"Model loaded: {config['model_name']}\")\n",
     "print(f\"Number of bigwig tracks: {len(config['bigwig_file_ids'])}\")\n",
+    "print(f\"Model parameters: {sum(p.numel() for p in model.parameters()):,}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 3. 📥 Dataset setup\n",
     "\n",
+    "Load the Hugging Face dataset and set up the data pipeline. The dataset automatically handles downloading FASTA and BigWig files, normalizing tracks, and sampling random genomic windows."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 14,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading dataset from InstaDeepAI/bigwig_tracks...\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d9e36ca0c8e544339833c04f68f485aa",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "README.md:   0%|          | 0.00/4.24k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "ename": "FileNotFoundError",
+     "evalue": "Couldn't find any data file at /home/y-bornachot/ntv3/notebooks/InstaDeepAI/bigwig_tracks. Couldn't find 'InstaDeepAI/bigwig_tracks' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/InstaDeepAI/bigwig_tracks@7fe68eaafda66223c3fe392f5fa2ad81173047a1/./data/chr1' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.nii.gz', '.NII', '.NII.GZ', '.zip']",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mFileNotFoundError\u001b[39m                         Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[14]\u001b[39m\u001b[32m, line 16\u001b[39m\n\u001b[32m      9\u001b[39m num_samples = {\n\u001b[32m     10\u001b[39m     \u001b[33m\"\u001b[39m\u001b[33mtrain\u001b[39m\u001b[33m\"\u001b[39m: config[\u001b[33m\"\u001b[39m\u001b[33mnum_steps_training\u001b[39m\u001b[33m\"\u001b[39m] * config[\u001b[33m\"\u001b[39m\u001b[33mbatch_size\u001b[39m\u001b[33m\"\u001b[39m],\n\u001b[32m     11\u001b[39m     \u001b[33m\"\u001b[39m\u001b[33mval\u001b[39m\u001b[33m\"\u001b[39m: config[\u001b[33m\"\u001b[39m\u001b[33mnum_validation_samples\u001b[39m\u001b[33m\"\u001b[39m],\n\u001b[32m     12\u001b[39m     \u001b[33m\"\u001b[39m\u001b[33mtest\u001b[39m\u001b[33m\"\u001b[39m: config[\u001b[33m\"\u001b[39m\u001b[33mnum_test_samples\u001b[39m\u001b[33m\"\u001b[39m],\n\u001b[32m     13\u001b[39m }\n\u001b[32m     15\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mLoading dataset from \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mconfig[\u001b[33m'\u001b[39m\u001b[33mdataset_name\u001b[39m\u001b[33m'\u001b[39m]\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m16\u001b[39m dataset = \u001b[43mload_dataset\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m     17\u001b[39m \u001b[43m    \u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mdataset_name\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     18\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchrom_splits\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     19\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnum_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mnum_samples\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     20\u001b[39m \u001b[43m    \u001b[49m\u001b[43mfasta_url\u001b[49m\u001b[43m=\u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mfasta_url\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     21\u001b[39m \u001b[43m    \u001b[49m\u001b[43mbigwig_urls\u001b[49m\u001b[43m=\u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mbigwig_url_list\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     22\u001b[39m \u001b[43m    \u001b[49m\u001b[43msequence_length\u001b[49m\u001b[43m=\u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msequence_length\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     23\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mdata_cache_dir\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m     24\u001b[39m \u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/venvs/ntv3-env/lib/python3.12/site-packages/datasets/load.py:1397\u001b[39m, in \u001b[36mload_dataset\u001b[39m\u001b[34m(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, **config_kwargs)\u001b[39m\n\u001b[32m   1392\u001b[39m verification_mode = VerificationMode(\n\u001b[32m   1393\u001b[39m     (verification_mode \u001b[38;5;129;01mor\u001b[39;00m VerificationMode.BASIC_CHECKS) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m save_infos \u001b[38;5;28;01melse\u001b[39;00m VerificationMode.ALL_CHECKS\n\u001b[32m   1394\u001b[39m )\n\u001b[32m   1396\u001b[39m \u001b[38;5;66;03m# Create a dataset builder\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1397\u001b[39m builder_instance = \u001b[43mload_dataset_builder\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m   1398\u001b[39m \u001b[43m    \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1399\u001b[39m \u001b[43m    \u001b[49m\u001b[43mname\u001b[49m\u001b[43m=\u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1400\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1401\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1402\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1403\u001b[39m \u001b[43m    \u001b[49m\u001b[43mfeatures\u001b[49m\u001b[43m=\u001b[49m\u001b[43mfeatures\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1404\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1405\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1406\u001b[39m \u001b[43m    \u001b[49m\u001b[43mrevision\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1407\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtoken\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtoken\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1408\u001b[39m \u001b[43m    \u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstorage_options\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1409\u001b[39m \u001b[43m    \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mconfig_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1410\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1412\u001b[39m \u001b[38;5;66;03m# Return iterable dataset in case of streaming\u001b[39;00m\n\u001b[32m   1413\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m streaming:\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/venvs/ntv3-env/lib/python3.12/site-packages/datasets/load.py:1137\u001b[39m, in \u001b[36mload_dataset_builder\u001b[39m\u001b[34m(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, storage_options, **config_kwargs)\u001b[39m\n\u001b[32m   1135\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m features \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m   1136\u001b[39m     features = _fix_for_backward_compatible_features(features)\n\u001b[32m-> \u001b[39m\u001b[32m1137\u001b[39m dataset_module = \u001b[43mdataset_module_factory\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m   1138\u001b[39m \u001b[43m    \u001b[49m\u001b[43mpath\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1139\u001b[39m \u001b[43m    \u001b[49m\u001b[43mrevision\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrevision\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1140\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdownload_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1141\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdownload_mode\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1142\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1143\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdata_files\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1144\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1145\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1146\u001b[39m \u001b[38;5;66;03m# Get dataset builder class\u001b[39;00m\n\u001b[32m   1147\u001b[39m builder_kwargs = dataset_module.builder_kwargs\n",
+      "\u001b[36mFile \u001b[39m\u001b[32m~/venvs/ntv3-env/lib/python3.12/site-packages/datasets/load.py:1032\u001b[39m, in \u001b[36mdataset_module_factory\u001b[39m\u001b[34m(path, revision, download_config, download_mode, data_dir, data_files, cache_dir, **download_kwargs)\u001b[39m\n\u001b[32m   1030\u001b[39m                 \u001b[38;5;28;01mraise\u001b[39;00m e1 \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m   1031\u001b[39m             \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(e1, \u001b[38;5;167;01mFileNotFoundError\u001b[39;00m):\n\u001b[32m-> \u001b[39m\u001b[32m1032\u001b[39m                 \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mFileNotFoundError\u001b[39;00m(\n\u001b[32m   1033\u001b[39m                     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mCouldn\u001b[39m\u001b[33m'\u001b[39m\u001b[33mt find any data file at \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mrelative_to_absolute_path(path)\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m. \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1034\u001b[39m                     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mCouldn\u001b[39m\u001b[33m'\u001b[39m\u001b[33mt find \u001b[39m\u001b[33m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpath\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m'\u001b[39m\u001b[33m on the Hugging Face Hub either: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(e1).\u001b[34m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00me1\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m\n\u001b[32m   1035\u001b[39m                 ) \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m   1036\u001b[39m             \u001b[38;5;28;01mraise\u001b[39;00m e1 \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m   1037\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n",
+      "\u001b[31mFileNotFoundError\u001b[39m: Couldn't find any data file at /home/y-bornachot/ntv3/notebooks/InstaDeepAI/bigwig_tracks. Couldn't find 'InstaDeepAI/bigwig_tracks' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/InstaDeepAI/bigwig_tracks@7fe68eaafda66223c3fe392f5fa2ad81173047a1/./data/chr1' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.nii.gz', '.NII', '.NII.GZ', '.zip']"
+     ]
+    }
+   ],
    "source": [
+    "# Chromosomes split definition\n",
+    "chrom_splits = {\n",
+    "    \"train\": [f\"chr{i}\" for i in range(1, 21)] + ['chrX', 'chrY'],\n",
+    "    \"val\": ['chr22'],\n",
+    "    \"test\": ['chr21']\n",
+    "}\n",
     "\n",
+    "# Number of desired samples per split\n",
+    "num_samples = {\n",
+    "    \"train\": config[\"num_steps_training\"] * config[\"batch_size\"],\n",
+    "    \"val\": config[\"num_validation_samples\"],\n",
+    "    \"test\": config[\"num_test_samples\"],\n",
+    "}\n",
     "\n",
+    "print(f\"Loading dataset from {config['dataset_name']}...\")\n",
+    "dataset = load_dataset(\n",
+    "    config[\"dataset_name\"],\n",
+    "    data_files=chrom_splits,\n",
+    "    num_samples=num_samples,\n",
+    "    fasta_url=config[\"fasta_url\"],\n",
+    "    bigwig_urls=config[\"bigwig_url_list\"],\n",
+    "    sequence_length=config[\"sequence_length\"],\n",
+    "    data_dir=config[\"data_cache_dir\"],\n",
+    ")"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Tokenization function\n",
+    "def tokenize_examples(examples):\n",
+    "    \"\"\"Tokenize sequences and prepare targets.\"\"\"\n",
+    "    sequences = examples[\"sequence\"]\n",
+    "    \n",
+    "    # Tokenize sequences\n",
+    "    tokenized = tokenizer(\n",
+    "        sequences,\n",
+    "        max_length=config[\"sequence_length\"],\n",
+    "        padding=\"max_length\",\n",
+    "        truncation=True,\n",
+    "        return_tensors=None,\n",
+    "    )\n",
+    "    \n",
+    "    # Crop targets to center fraction if needed\n",
+    "    if config[\"keep_target_center_fraction\"] < 1.0:\n",
+    "        seq_len = examples[\"bigwig_targets\"].shape[0]\n",
+    "        target_offset = int(seq_len * (1 - config[\"keep_target_center_fraction\"]) // 2)\n",
+    "        target_length = seq_len - 2 * target_offset\n",
+    "        examples[\"bigwig_targets\"] = examples[\"bigwig_targets\"][target_offset:target_offset + target_length, :]\n",
+    "    \n",
+    "    return {\n",
+    "        \"tokens\": tokenized[\"input_ids\"],\n",
+    "        \"bigwig_targets\": examples[\"bigwig_targets\"],\n",
+    "    }\n",
     "\n",
+    "# Apply tokenization\n",
+    "print(\"Tokenizing sequences...\")\n",
+    "dataset = dataset.map(\n",
+    "    tokenize_examples,\n",
+    "    batched=True,\n",
+    "    remove_columns=[\"sequence\"],  # Remove original sequence after tokenization\n",
     ")\n",
     "\n",
+    "# Format for PyTorch\n",
+    "dataset = dataset.with_format(\"torch\")\n",
     "\n",
+    "dataloaders = {}\n",
+    "for split_name in chrom_splits.keys():\n",
+    "    dataloaders[split_name] = DataLoader(\n",
+    "        dataset[split_name],\n",
+    "        batch_size=config[\"batch_size\"],\n",
+    "        shuffle=(split_name == \"train\"),\n",
+    "        num_workers=config[\"num_workers\"],\n",
+    "    )\n",
     "\n",
+    "# Extract DataLoaders\n",
+    "train_loader = dataloaders[\"train\"]\n",
+    "val_loader = dataloaders[\"val\"]\n",
+    "test_loader = dataloaders[\"test\"]\n",
     "\n",
+    "print(f\"\\nData pipeline created successfully!\")\n",
+    "print(f\"Train batches: {len(train_loader)}\")\n",
+    "print(f\"Val batches: {len(val_loader)}\")\n",
+    "print(f\"Test batches: {len(test_loader)}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 4. ⚙️ Optimizer setup\n",
     "\n",
+    "Configure the AdamW optimizer with learning rate and weight decay hyperparameters. This optimizer will update the model parameters during training to minimize the loss function."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 15,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training configuration:\n",
+      "  Batch size: 32\n",
+      "  Total training steps: 19932\n",
+      "  Log metrics every: 40 steps\n",
+      "  Validate every: 400 steps\n",
+      "\n",
+      "Optimizer setup:\n",
+      "  Learning rate: 1e-05\n"
+     ]
+    }
+   ],
    "source": [
     "# Training setup\n",
     "print(f\"Training configuration:\")\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 5. 📊 Metrics setup\n",
     "\n",
     "Set up evaluation metrics to track model performance during training and validation. We use Pearson correlation coefficients to measure how well the predicted BigWig signals match the ground truth signals."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 6. 📉 Loss function\n",
     "\n",
     "Define the Poisson-Multinomial loss function that captures both the scale (total signal) and shape (distribution) of BigWig tracks. This loss is specifically designed for count-based genomic signal data."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 7. 🏃 Training loop\n",
     "\n",
     "Run the main training loop that iterates through batches, computes gradients, and updates model parameters. The loop includes periodic validation checks and real-time metric visualization to monitor training progress."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
   },
   {
    "cell_type": "code",
+   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# 9. 🧪 Test evaluation\n",
     "\n",
     "Evaluate the fine-tuned model on the held-out test set to assess final performance. This provides an unbiased estimate of how well the model generalizes to unseen genomic regions."
    ]
  },
  "nbformat": 4,
  "nbformat_minor": 2
+}