Spaces:
Running
Running
| <div class="summary"> | |
| <h2>🧬 NTv3 Post-Trained Functional Track Prediction</h2> | |
| <p>This notebook demonstrates how to use the NTv3 post-trained model to predict functional tracks and genome annotation directly from a DNA sequence.</p> | |
| <p>The pipeline abstracts away all the underlying steps: running inference with the model and plotting the predictions per tracks.</p> | |
| <p>If you're interested in exploring the intermediate probabilities, please refer to the <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">track-prediction notebook</a>.</p> | |
| <p> | |
| <strong>🔗 Quick links:</strong><br> | |
| • <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a><br> | |
| • <a href="https://colab.research.google.com/github/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">Open directly in Google Colab</a> | |
| </p> | |
| </div> | |
| <div class="grid"> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>0) 📦 Imports + setup</h2> | |
| <p>Install dependencies:</p> | |
| <div class="code"><pre><code class="language-bash">pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook pyBigWig</code></pre></div> | |
| <p style="margin-top: 20px;">Import required libraries:</p> | |
| <div class="code"><pre><code class="language-python">import re | |
| import time | |
| import os | |
| import torch | |
| import requests | |
| import numpy as np | |
| import pyBigWig | |
| from transformers import pipeline, AutoConfig</code></pre></div> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>1) 📦 Configuration</h2> | |
| <p>Set your NTv3 model and genomic window here:</p> | |
| <div class="code"><pre><code class="language-python"># Define the model and genomic window | |
| model_name = "InstaDeepAI/NTv3_650M_pos" | |
| species = "human" # will use for condition the model on species | |
| assembly = "hg38" # will use for fetching the chromosome sequence | |
| chrom = "chr19" | |
| start = 6_700_000 | |
| end = 6_831_072</code></pre></div> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>2) 📥 Fetch chromosome sequence for the chosen window</h2> | |
| <div class="code"><pre><code class="language-python"># Get the sequence from the UCSC API | |
| url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}" | |
| seq = requests.get(url).json()["dna"].upper() | |
| print(f"Original sequence length: {len(seq)}") | |
| # Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible) | |
| seq = seq[:int(len(seq) // 128) * 128] | |
| print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")</code></pre></div> | |
| <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;"> | |
| <strong style="color: var(--muted);">Output:</strong><br> | |
| Original sequence length: 131072<br> | |
| Cropped sequence length: 131072, 1024.0 transformer tokens | |
| </div> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>3) ⚡ Functional track prediction pipeline (pre-processing, inference, plotting)</h2> | |
| <div class="code"><pre><code class="language-python"># Build NTv3 tracks pipeline | |
| ntv3_tracks = pipeline( | |
| "ntv3-tracks", | |
| model=model_name, | |
| trust_remote_code=True, | |
| device=0 if torch.cuda.is_available() else -1, | |
| ) | |
| # Select tracks to plot | |
| tracks_to_plot = { | |
| "K562 RNA-seq": "ENCSR056HPM", | |
| "K562 DNAse": "ENCSR921NMD", | |
| "K562 H3k4me3": "ENCSR000DWD", | |
| "K562 CTCF": "ENCSR000AKO", | |
| "HepG2 RNA-seq": "ENCSR561FEE_P", | |
| "HepG2 DNAse": "ENCSR000EJV", | |
| "HepG2 H3k4me3": "ENCSR000AMP", | |
| "HepG2 CTCF": "ENCSR000BIE", | |
| } | |
| elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"] | |
| # Run pipeline: DNA -> NTv3 -> Tracks -> plot | |
| start_time = time.time() | |
| ntv3_predictions = ntv3_tracks( | |
| {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": species}, | |
| plot=True, | |
| tracks_to_plot=tracks_to_plot, | |
| elements_to_plot=elements_to_plot, | |
| ) | |
| end_time = time.time() | |
| print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")</code></pre></div> | |
| <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;"> | |
| <strong style="color: var(--muted);">Output:</strong><br> | |
| Device set to use cpu<br> | |
| Running on device: cpu<br> | |
| Inference + decoding time: 38.32 seconds | |
| </div> | |
| <div style="margin-top: 20px;"> | |
| <img src="assets/output_tracks.png" alt="Output tracks plot" style="width: 100%; height: auto; border-radius: 12px; border: 1px solid var(--border);" /> | |
| </div> | |
| <p style="margin-top: 15px; color: var(--muted); font-size: 13px;"> | |
| The pipeline performs all the necessary steps: running inference with the model and plotting the predictions for the specified tracks and genomic elements. | |
| </p> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>4) 📁 Save as BigWig file</h2> | |
| <div class="code"><pre><code class="language-python"># Load config to get track names and find indices for tracks_to_plot | |
| cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True) | |
| all_bigwig_names = cfg.bigwigs_per_file_assembly[assembly] | |
| # Find indices of tracks we want to save | |
| # Use display names (keys) for filenames, but track IDs (values) to find indices | |
| track_data_list = [] # List of (display_name, track_id, index) tuples | |
| for display_name, track_id in tracks_to_plot.items(): | |
| try: | |
| idx = all_bigwig_names.index(track_id) | |
| track_data_list.append((display_name, track_id, idx)) | |
| except ValueError: | |
| print(f"Warning: Track '{track_id}' ({display_name}) not found in config. Skipping...") | |
| print(f"Found {len(track_data_list)} tracks to save from tracks_to_plot") | |
| # Get predictions (shape: (49152, 7362)) | |
| bigwig_logits = ntv3_predictions.bigwig_tracks_logits | |
| if isinstance(bigwig_logits, torch.Tensor): | |
| bigwig_logits = bigwig_logits.detach().cpu().numpy() | |
| # Calculate genomic coordinates for the center 37.5% region | |
| # The predictions cover the center 37.5% of the input sequence | |
| input_length = end - start | |
| center_start_offset = int(input_length * 0.3125) # (1 - 0.375) / 2 = 0.3125 | |
| center_length = int(input_length * 0.375) | |
| center_start = start + center_start_offset | |
| center_end = center_start + center_length | |
| print(f"Input region: {chrom}:{start}-{end} (length: {input_length:,} bp)") | |
| print(f"Prediction region: {chrom}:{center_start}-{center_end} (length: {center_length:,} bp)") | |
| print(f"Number of positions: {bigwig_logits.shape[0]}") | |
| # Create output directory | |
| output_dir = "bigwig_outputs" | |
| os.makedirs(output_dir, exist_ok=True) | |
| # Save each track as a separate BigWig file | |
| print(f"\nSaving BigWig files to '{output_dir}/' directory...") | |
| for i, (display_name, track_id, track_idx) in enumerate(track_data_list): | |
| # Get track data (logits for this track) | |
| track_data = bigwig_logits[:, track_idx].astype(np.float32) | |
| # Create BigWig file using display name (key) for filename | |
| # Clean the display name for use as filename (replace spaces, special chars) | |
| track_clean_name = display_name.replace(" ", "_").replace("/", "_").replace("-", "_") | |
| bw_filename = os.path.join(output_dir, f"{track_clean_name}.bw") | |
| bw = pyBigWig.open(bw_filename, "w") | |
| # Add header (chromosome and size) | |
| bw.addHeader([(chrom, end)]) | |
| # Add entries (intervals with values) | |
| # Each position in track_data corresponds to one base pair | |
| starts = np.arange(center_start, center_start + len(track_data), dtype=np.int64) | |
| ends = starts + 1 | |
| values = track_data.tolist() | |
| bw.addEntries( | |
| chroms=[chrom] * len(starts), | |
| starts=starts.tolist(), | |
| ends=ends.tolist(), | |
| values=values | |
| ) | |
| bw.close() | |
| print(f" Saved {i + 1}/{len(track_data_list)}: {display_name} ({track_clean_name}.bw)") | |
| print(f"\n✅ Successfully saved {len(track_data_list)} BigWig files to '{output_dir}/'") | |
| print(f" Files: {', '.join([name.replace(' ', '_').replace('/', '_').replace('-', '_') for name, _, _ in track_data_list])}")</code></pre></div> | |
| <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6; white-space: pre-wrap;"> | |
| <strong style="color: var(--muted);">Output:</strong><br>Found 8 tracks to save from tracks_to_plot | |
| Input region: chr19:6700000-6831072 (length: 131,072 bp) | |
| Prediction region: chr19:6740960-6790112 (length: 49,152 bp) | |
| Number of positions: 49152 | |
| Saving BigWig files to 'bigwig_outputs/' directory... | |
| Saved 1/8: K562 RNA-seq (K562_RNA_seq.bw) | |
| Saved 2/8: K562 DNAse (K562_DNAse.bw) | |
| Saved 3/8: K562 H3k4me3 (K562_H3k4me3.bw) | |
| Saved 4/8: K562 CTCF (K562_CTCF.bw) | |
| Saved 5/8: HepG2 RNA-seq (HepG2_RNA_seq.bw) | |
| Saved 6/8: HepG2 DNAse (HepG2_DNAse.bw) | |
| Saved 7/8: HepG2 H3k4me3 (HepG2_H3k4me3.bw) | |
| Saved 8/8: HepG2 CTCF (HepG2_CTCF.bw) | |
| ✅ Successfully saved 8 BigWig files to 'bigwig_outputs/' | |
| Files: K562_RNA_seq, K562_DNAse, K562_H3k4me3, K562_CTCF, HepG2_RNA_seq, HepG2_DNAse, HepG2_H3k4me3, HepG2_CTCF | |
| </div> | |
| <p style="margin-top: 15px; color: var(--muted); font-size: 13px;"> | |
| This saves each selected functional track as a separate BigWig file that can be visualized in genome browsers. The files are saved with user-friendly display names (e.g., "K562_RNA_seq.bw"). | |
| </p> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>5) 🌐 Create an IGV Browser</h2> | |
| <div class="code"><pre><code class="language-python">import igv_notebook | |
| igv_notebook.init() | |
| # Build tracks array with all BigWig files we saved | |
| tracks = [] | |
| for track_display_name, track_id in tracks_to_plot.items(): | |
| # Clean the display name to match the filename we saved | |
| track_clean_name = track_display_name.replace(" ", "_").replace("/", "_").replace("-", "_") | |
| bigwig_path = os.path.join(output_dir, f"{track_clean_name}.bw") | |
| bigwig_track = { | |
| "name": track_display_name, | |
| "format": "bigwig", | |
| "url": bigwig_path, | |
| "height": 70, | |
| "autoscale": True, | |
| "displayMode": "EXPANDED", | |
| } | |
| tracks.append(bigwig_track) | |
| config = { | |
| "genome": assembly, | |
| "locus": f"{chrom}:{center_start}-{center_end}", | |
| "tracks": tracks, | |
| "theme": "dark", | |
| } | |
| browser = igv_notebook.Browser(config) | |
| browser # <- just return the object, no .show()</code></pre></div> | |
| <p style="margin-top: 15px; color: var(--muted); font-size: 13px;"> | |
| This creates an interactive IGV browser visualization with a dark theme showing all the predicted functional tracks. The BigWig files can also be visualized in any genome browser. | |
| </p> | |
| </div> | |
| <div class="card" style="grid-column: span 12;"> | |
| <h2>📓 Full Notebook</h2> | |
| <p>To view and run the complete notebook interactively:</p> | |
| <ul> | |
| <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a></li> | |
| <li>Download and run in Jupyter, Google Colab, or any notebook environment</li> | |
| </ul> | |
| </div> | |
| </div> | |