🧬 NTv3 Post-Trained Functional Track Prediction

This notebook demonstrates how to use the NTv3 post-trained model to predict functional tracks and genome annotation directly from a DNA sequence.

The pipeline abstracts away all the underlying steps: running inference with the model and plotting the predictions per tracks.

If you're interested in exploring the intermediate probabilities, please refer to the track-prediction notebook.

šŸ”— Quick links:
• View notebook on Hugging Face
• Open directly in Google Colab

0) šŸ“¦ Imports + setup

Install dependencies:

pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook pyBigWig

Import required libraries:

import re
import time
import os
import torch
import requests
import numpy as np
import pyBigWig
from transformers import pipeline, AutoConfig

1) šŸ“¦ Configuration

Set your NTv3 model and genomic window here:

# Define the model and genomic window
model_name = "InstaDeepAI/NTv3_650M_pos"

species = "human"  # will use for condition the model on species
assembly = "hg38"  # will use for fetching the chromosome sequence

chrom = "chr19"
start = 6_700_000
end = 6_831_072

2) šŸ“„ Fetch chromosome sequence for the chosen window

# Get the sequence from the UCSC API
url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}"
seq = requests.get(url).json()["dna"].upper()
print(f"Original sequence length: {len(seq)}")

# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)
seq = seq[:int(len(seq) // 128) * 128]
print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")
Output:
Original sequence length: 131072
Cropped sequence length: 131072, 1024.0 transformer tokens

3) ⚔ Functional track prediction pipeline (pre-processing, inference, plotting)

# Build NTv3 tracks pipeline
ntv3_tracks = pipeline(
    "ntv3-tracks",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Select tracks to plot
tracks_to_plot = {
    "K562 RNA-seq": "ENCSR056HPM",
    "K562 DNAse": "ENCSR921NMD",
    "K562 H3k4me3": "ENCSR000DWD",
    "K562 CTCF": "ENCSR000AKO",
    "HepG2 RNA-seq": "ENCSR561FEE_P",
    "HepG2 DNAse": "ENCSR000EJV",
    "HepG2 H3k4me3": "ENCSR000AMP",
    "HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

# Run pipeline: DNA -> NTv3 -> Tracks -> plot
start_time = time.time()

ntv3_predictions = ntv3_tracks(
    {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": species},
    plot=True,
    tracks_to_plot=tracks_to_plot,
    elements_to_plot=elements_to_plot,
)

end_time = time.time()

print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")
Output:
Device set to use cpu
Running on device: cpu
Inference + decoding time: 38.32 seconds
Output tracks plot

The pipeline performs all the necessary steps: running inference with the model and plotting the predictions for the specified tracks and genomic elements.

4) šŸ“ Save as BigWig file

# Load config to get track names and find indices for tracks_to_plot
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
all_bigwig_names = cfg.bigwigs_per_file_assembly[assembly]

# Find indices of tracks we want to save
# Use display names (keys) for filenames, but track IDs (values) to find indices
track_data_list = []  # List of (display_name, track_id, index) tuples
for display_name, track_id in tracks_to_plot.items():
    try:
        idx = all_bigwig_names.index(track_id)
        track_data_list.append((display_name, track_id, idx))
    except ValueError:
        print(f"Warning: Track '{track_id}' ({display_name}) not found in config. Skipping...")

print(f"Found {len(track_data_list)} tracks to save from tracks_to_plot")

# Get predictions (shape: (49152, 7362))
bigwig_logits = ntv3_predictions.bigwig_tracks_logits
if isinstance(bigwig_logits, torch.Tensor):
    bigwig_logits = bigwig_logits.detach().cpu().numpy()

# Calculate genomic coordinates for the center 37.5% region
# The predictions cover the center 37.5% of the input sequence
input_length = end - start
center_start_offset = int(input_length * 0.3125)  # (1 - 0.375) / 2 = 0.3125
center_length = int(input_length * 0.375)
center_start = start + center_start_offset
center_end = center_start + center_length

print(f"Input region: {chrom}:{start}-{end} (length: {input_length:,} bp)")
print(f"Prediction region: {chrom}:{center_start}-{center_end} (length: {center_length:,} bp)")
print(f"Number of positions: {bigwig_logits.shape[0]}")

# Create output directory
output_dir = "bigwig_outputs"
os.makedirs(output_dir, exist_ok=True)

# Save each track as a separate BigWig file
print(f"\nSaving BigWig files to '{output_dir}/' directory...")
for i, (display_name, track_id, track_idx) in enumerate(track_data_list):
    # Get track data (logits for this track)
    track_data = bigwig_logits[:, track_idx].astype(np.float32)
    
    # Create BigWig file using display name (key) for filename
    # Clean the display name for use as filename (replace spaces, special chars)
    track_clean_name = display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
    bw_filename = os.path.join(output_dir, f"{track_clean_name}.bw")
    bw = pyBigWig.open(bw_filename, "w")
    
    # Add header (chromosome and size)
    bw.addHeader([(chrom, end)])
    
    # Add entries (intervals with values)
    # Each position in track_data corresponds to one base pair
    starts = np.arange(center_start, center_start + len(track_data), dtype=np.int64)
    ends = starts + 1
    values = track_data.tolist()
    
    bw.addEntries(
        chroms=[chrom] * len(starts),
        starts=starts.tolist(),
        ends=ends.tolist(),
        values=values
    )
    
    bw.close()
    
    print(f"  Saved {i + 1}/{len(track_data_list)}: {display_name} ({track_clean_name}.bw)")

print(f"\nāœ… Successfully saved {len(track_data_list)} BigWig files to '{output_dir}/'")
print(f"   Files: {', '.join([name.replace(' ', '_').replace('/', '_').replace('-', '_') for name, _, _ in track_data_list])}")
Output:
Found 8 tracks to save from tracks_to_plot Input region: chr19:6700000-6831072 (length: 131,072 bp) Prediction region: chr19:6740960-6790112 (length: 49,152 bp) Number of positions: 49152 Saving BigWig files to 'bigwig_outputs/' directory... Saved 1/8: K562 RNA-seq (K562_RNA_seq.bw) Saved 2/8: K562 DNAse (K562_DNAse.bw) Saved 3/8: K562 H3k4me3 (K562_H3k4me3.bw) Saved 4/8: K562 CTCF (K562_CTCF.bw) Saved 5/8: HepG2 RNA-seq (HepG2_RNA_seq.bw) Saved 6/8: HepG2 DNAse (HepG2_DNAse.bw) Saved 7/8: HepG2 H3k4me3 (HepG2_H3k4me3.bw) Saved 8/8: HepG2 CTCF (HepG2_CTCF.bw) āœ… Successfully saved 8 BigWig files to 'bigwig_outputs/' Files: K562_RNA_seq, K562_DNAse, K562_H3k4me3, K562_CTCF, HepG2_RNA_seq, HepG2_DNAse, HepG2_H3k4me3, HepG2_CTCF

This saves each selected functional track as a separate BigWig file that can be visualized in genome browsers. The files are saved with user-friendly display names (e.g., "K562_RNA_seq.bw").

5) 🌐 Create an IGV Browser

import igv_notebook

igv_notebook.init()

# Build tracks array with all BigWig files we saved
tracks = []
for track_display_name, track_id in tracks_to_plot.items():
    # Clean the display name to match the filename we saved
    track_clean_name = track_display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
    bigwig_path = os.path.join(output_dir, f"{track_clean_name}.bw")
    bigwig_track = {
        "name": track_display_name,
        "format": "bigwig",
        "url": bigwig_path,
        "height": 70,
        "autoscale": True,
        "displayMode": "EXPANDED",
    }
    tracks.append(bigwig_track)

config = {
    "genome": assembly,
    "locus": f"{chrom}:{center_start}-{center_end}",
    "tracks": tracks,
    "theme": "dark",
}

browser = igv_notebook.Browser(config)
browser  # <- just return the object, no .show()

This creates an interactive IGV browser visualization with a dark theme showing all the predicted functional tracks. The BigWig files can also be visualized in any genome browser.

šŸ““ Full Notebook

To view and run the complete notebook interactively: