Spaces:
Running
Running
File size: 11,862 Bytes
a82ff3a 1dc15bb 9ed59c5 1dc15bb 9ed59c5 a82ff3a 9ed59c5 a82ff3a 9ed59c5 a82ff3a 9ed59c5 a82ff3a 9ed59c5 a82ff3a 1dc15bb a82ff3a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
<div class="summary">
<h2>🧬 NTv3 Post-Trained Functional Track Prediction</h2>
<p>This notebook demonstrates how to use the NTv3 post-trained model to predict functional tracks and genome annotation directly from a DNA sequence.</p>
<p>The pipeline abstracts away all the underlying steps: running inference with the model and plotting the predictions per tracks.</p>
<p>If you're interested in exploring the intermediate probabilities, please refer to the <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">track-prediction notebook</a>.</p>
<p>
<strong>🔗 Quick links:</strong><br>
• <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a><br>
• <a href="https://colab.research.google.com/github/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">Open directly in Google Colab</a>
</p>
</div>
<div class="grid">
<div class="card" style="grid-column: span 12;">
<h2>0) 📦 Imports + setup</h2>
<p>Install dependencies:</p>
<div class="code"><pre><code class="language-bash">pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook pyBigWig</code></pre></div>
<p style="margin-top: 20px;">Import required libraries:</p>
<div class="code"><pre><code class="language-python">import re
import time
import os
import torch
import requests
import numpy as np
import pyBigWig
from transformers import pipeline, AutoConfig</code></pre></div>
</div>
<div class="card" style="grid-column: span 12;">
<h2>1) 📦 Configuration</h2>
<p>Set your NTv3 model and genomic window here:</p>
<div class="code"><pre><code class="language-python"># Define the model and genomic window
model_name = "InstaDeepAI/NTv3_650M_pos"
species = "human" # will use for condition the model on species
assembly = "hg38" # will use for fetching the chromosome sequence
chrom = "chr19"
start = 6_700_000
end = 6_831_072</code></pre></div>
</div>
<div class="card" style="grid-column: span 12;">
<h2>2) 📥 Fetch chromosome sequence for the chosen window</h2>
<div class="code"><pre><code class="language-python"># Get the sequence from the UCSC API
url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}"
seq = requests.get(url).json()["dna"].upper()
print(f"Original sequence length: {len(seq)}")
# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)
seq = seq[:int(len(seq) // 128) * 128]
print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")</code></pre></div>
<div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
<strong style="color: var(--muted);">Output:</strong><br>
Original sequence length: 131072<br>
Cropped sequence length: 131072, 1024.0 transformer tokens
</div>
</div>
<div class="card" style="grid-column: span 12;">
<h2>3) ⚡ Functional track prediction pipeline (pre-processing, inference, plotting)</h2>
<div class="code"><pre><code class="language-python"># Build NTv3 tracks pipeline
ntv3_tracks = pipeline(
"ntv3-tracks",
model=model_name,
trust_remote_code=True,
device=0 if torch.cuda.is_available() else -1,
)
# Select tracks to plot
tracks_to_plot = {
"K562 RNA-seq": "ENCSR056HPM",
"K562 DNAse": "ENCSR921NMD",
"K562 H3k4me3": "ENCSR000DWD",
"K562 CTCF": "ENCSR000AKO",
"HepG2 RNA-seq": "ENCSR561FEE_P",
"HepG2 DNAse": "ENCSR000EJV",
"HepG2 H3k4me3": "ENCSR000AMP",
"HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]
# Run pipeline: DNA -> NTv3 -> Tracks -> plot
start_time = time.time()
ntv3_predictions = ntv3_tracks(
{"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": species},
plot=True,
tracks_to_plot=tracks_to_plot,
elements_to_plot=elements_to_plot,
)
end_time = time.time()
print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")</code></pre></div>
<div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
<strong style="color: var(--muted);">Output:</strong><br>
Device set to use cpu<br>
Running on device: cpu<br>
Inference + decoding time: 38.32 seconds
</div>
<div style="margin-top: 20px;">
<img src="assets/output_tracks.png" alt="Output tracks plot" style="width: 100%; height: auto; border-radius: 12px; border: 1px solid var(--border);" />
</div>
<p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
The pipeline performs all the necessary steps: running inference with the model and plotting the predictions for the specified tracks and genomic elements.
</p>
</div>
<div class="card" style="grid-column: span 12;">
<h2>4) 📁 Save as BigWig file</h2>
<div class="code"><pre><code class="language-python"># Load config to get track names and find indices for tracks_to_plot
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
all_bigwig_names = cfg.bigwigs_per_file_assembly[assembly]
# Find indices of tracks we want to save
# Use display names (keys) for filenames, but track IDs (values) to find indices
track_data_list = [] # List of (display_name, track_id, index) tuples
for display_name, track_id in tracks_to_plot.items():
try:
idx = all_bigwig_names.index(track_id)
track_data_list.append((display_name, track_id, idx))
except ValueError:
print(f"Warning: Track '{track_id}' ({display_name}) not found in config. Skipping...")
print(f"Found {len(track_data_list)} tracks to save from tracks_to_plot")
# Get predictions (shape: (49152, 7362))
bigwig_logits = ntv3_predictions.bigwig_tracks_logits
if isinstance(bigwig_logits, torch.Tensor):
bigwig_logits = bigwig_logits.detach().cpu().numpy()
# Calculate genomic coordinates for the center 37.5% region
# The predictions cover the center 37.5% of the input sequence
input_length = end - start
center_start_offset = int(input_length * 0.3125) # (1 - 0.375) / 2 = 0.3125
center_length = int(input_length * 0.375)
center_start = start + center_start_offset
center_end = center_start + center_length
print(f"Input region: {chrom}:{start}-{end} (length: {input_length:,} bp)")
print(f"Prediction region: {chrom}:{center_start}-{center_end} (length: {center_length:,} bp)")
print(f"Number of positions: {bigwig_logits.shape[0]}")
# Create output directory
output_dir = "bigwig_outputs"
os.makedirs(output_dir, exist_ok=True)
# Save each track as a separate BigWig file
print(f"\nSaving BigWig files to '{output_dir}/' directory...")
for i, (display_name, track_id, track_idx) in enumerate(track_data_list):
# Get track data (logits for this track)
track_data = bigwig_logits[:, track_idx].astype(np.float32)
# Create BigWig file using display name (key) for filename
# Clean the display name for use as filename (replace spaces, special chars)
track_clean_name = display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
bw_filename = os.path.join(output_dir, f"{track_clean_name}.bw")
bw = pyBigWig.open(bw_filename, "w")
# Add header (chromosome and size)
bw.addHeader([(chrom, end)])
# Add entries (intervals with values)
# Each position in track_data corresponds to one base pair
starts = np.arange(center_start, center_start + len(track_data), dtype=np.int64)
ends = starts + 1
values = track_data.tolist()
bw.addEntries(
chroms=[chrom] * len(starts),
starts=starts.tolist(),
ends=ends.tolist(),
values=values
)
bw.close()
print(f" Saved {i + 1}/{len(track_data_list)}: {display_name} ({track_clean_name}.bw)")
print(f"\n✅ Successfully saved {len(track_data_list)} BigWig files to '{output_dir}/'")
print(f" Files: {', '.join([name.replace(' ', '_').replace('/', '_').replace('-', '_') for name, _, _ in track_data_list])}")</code></pre></div>
<div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6; white-space: pre-wrap;">
<strong style="color: var(--muted);">Output:</strong><br>Found 8 tracks to save from tracks_to_plot
Input region: chr19:6700000-6831072 (length: 131,072 bp)
Prediction region: chr19:6740960-6790112 (length: 49,152 bp)
Number of positions: 49152
Saving BigWig files to 'bigwig_outputs/' directory...
Saved 1/8: K562 RNA-seq (K562_RNA_seq.bw)
Saved 2/8: K562 DNAse (K562_DNAse.bw)
Saved 3/8: K562 H3k4me3 (K562_H3k4me3.bw)
Saved 4/8: K562 CTCF (K562_CTCF.bw)
Saved 5/8: HepG2 RNA-seq (HepG2_RNA_seq.bw)
Saved 6/8: HepG2 DNAse (HepG2_DNAse.bw)
Saved 7/8: HepG2 H3k4me3 (HepG2_H3k4me3.bw)
Saved 8/8: HepG2 CTCF (HepG2_CTCF.bw)
✅ Successfully saved 8 BigWig files to 'bigwig_outputs/'
Files: K562_RNA_seq, K562_DNAse, K562_H3k4me3, K562_CTCF, HepG2_RNA_seq, HepG2_DNAse, HepG2_H3k4me3, HepG2_CTCF
</div>
<p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
This saves each selected functional track as a separate BigWig file that can be visualized in genome browsers. The files are saved with user-friendly display names (e.g., "K562_RNA_seq.bw").
</p>
</div>
<div class="card" style="grid-column: span 12;">
<h2>5) 🌐 Create an IGV Browser</h2>
<div class="code"><pre><code class="language-python">import igv_notebook
igv_notebook.init()
# Build tracks array with all BigWig files we saved
tracks = []
for track_display_name, track_id in tracks_to_plot.items():
# Clean the display name to match the filename we saved
track_clean_name = track_display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
bigwig_path = os.path.join(output_dir, f"{track_clean_name}.bw")
bigwig_track = {
"name": track_display_name,
"format": "bigwig",
"url": bigwig_path,
"height": 70,
"autoscale": True,
"displayMode": "EXPANDED",
}
tracks.append(bigwig_track)
config = {
"genome": assembly,
"locus": f"{chrom}:{center_start}-{center_end}",
"tracks": tracks,
"theme": "dark",
}
browser = igv_notebook.Browser(config)
browser # <- just return the object, no .show()</code></pre></div>
<p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
This creates an interactive IGV browser visualization with a dark theme showing all the predicted functional tracks. The BigWig files can also be visualized in any genome browser.
</p>
</div>
<div class="card" style="grid-column: span 12;">
<h2>📓 Full Notebook</h2>
<p>To view and run the complete notebook interactively:</p>
<ul>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a></li>
<li>Download and run in Jupyter, Google Colab, or any notebook environment</li>
</ul>
</div>
</div>
|