File size: 11,862 Bytes
a82ff3a
 
 
 
1dc15bb
9ed59c5
 
1dc15bb
 
9ed59c5
a82ff3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ed59c5
 
a82ff3a
 
9ed59c5
a82ff3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ed59c5
 
 
 
 
 
 
 
 
a82ff3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ed59c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a82ff3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dc15bb
a82ff3a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
<div class="summary">
  <h2>🧬 NTv3 Post-Trained Functional Track Prediction</h2>
  <p>This notebook demonstrates how to use the NTv3 post-trained model to predict functional tracks and genome annotation directly from a DNA sequence.</p>
  <p>The pipeline abstracts away all the underlying steps: running inference with the model and plotting the predictions per tracks.</p>
  <p>If you're interested in exploring the intermediate probabilities, please refer to the <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">track-prediction notebook</a>.</p>
  <p>
    <strong>🔗 Quick links:</strong><br><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a><br><a href="https://colab.research.google.com/github/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">Open directly in Google Colab</a>
  </p>
</div>

<div class="grid">
  <div class="card" style="grid-column: span 12;">
    <h2>0) 📦 Imports + setup</h2>
    <p>Install dependencies:</p>
    <div class="code"><pre><code class="language-bash">pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook pyBigWig</code></pre></div>
    
    <p style="margin-top: 20px;">Import required libraries:</p>
    <div class="code"><pre><code class="language-python">import re
import time
import os
import torch
import requests
import numpy as np
import pyBigWig
from transformers import pipeline, AutoConfig</code></pre></div>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>1) 📦 Configuration</h2>
    <p>Set your NTv3 model and genomic window here:</p>
    <div class="code"><pre><code class="language-python"># Define the model and genomic window
model_name = "InstaDeepAI/NTv3_650M_pos"

species = "human"  # will use for condition the model on species
assembly = "hg38"  # will use for fetching the chromosome sequence

chrom = "chr19"
start = 6_700_000
end = 6_831_072</code></pre></div>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>2) 📥 Fetch chromosome sequence for the chosen window</h2>
    <div class="code"><pre><code class="language-python"># Get the sequence from the UCSC API
url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}"
seq = requests.get(url).json()["dna"].upper()
print(f"Original sequence length: {len(seq)}")

# Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)
seq = seq[:int(len(seq) // 128) * 128]
print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")</code></pre></div>
    <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
      <strong style="color: var(--muted);">Output:</strong><br>
      Original sequence length: 131072<br>
      Cropped sequence length: 131072, 1024.0 transformer tokens
    </div>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>3) ⚡ Functional track prediction pipeline (pre-processing, inference, plotting)</h2>
    <div class="code"><pre><code class="language-python"># Build NTv3 tracks pipeline
ntv3_tracks = pipeline(
    "ntv3-tracks",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Select tracks to plot
tracks_to_plot = {
    "K562 RNA-seq": "ENCSR056HPM",
    "K562 DNAse": "ENCSR921NMD",
    "K562 H3k4me3": "ENCSR000DWD",
    "K562 CTCF": "ENCSR000AKO",
    "HepG2 RNA-seq": "ENCSR561FEE_P",
    "HepG2 DNAse": "ENCSR000EJV",
    "HepG2 H3k4me3": "ENCSR000AMP",
    "HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

# Run pipeline: DNA -> NTv3 -> Tracks -> plot
start_time = time.time()

ntv3_predictions = ntv3_tracks(
    {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": species},
    plot=True,
    tracks_to_plot=tracks_to_plot,
    elements_to_plot=elements_to_plot,
)

end_time = time.time()

print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")</code></pre></div>
    <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
      <strong style="color: var(--muted);">Output:</strong><br>
      Device set to use cpu<br>
      Running on device: cpu<br>
      Inference + decoding time: 38.32 seconds
    </div>
    <div style="margin-top: 20px;">
      <img src="assets/output_tracks.png" alt="Output tracks plot" style="width: 100%; height: auto; border-radius: 12px; border: 1px solid var(--border);" />
    </div>
    <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
      The pipeline performs all the necessary steps: running inference with the model and plotting the predictions for the specified tracks and genomic elements.
    </p>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>4) 📁 Save as BigWig file</h2>
    <div class="code"><pre><code class="language-python"># Load config to get track names and find indices for tracks_to_plot
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
all_bigwig_names = cfg.bigwigs_per_file_assembly[assembly]

# Find indices of tracks we want to save
# Use display names (keys) for filenames, but track IDs (values) to find indices
track_data_list = []  # List of (display_name, track_id, index) tuples
for display_name, track_id in tracks_to_plot.items():
    try:
        idx = all_bigwig_names.index(track_id)
        track_data_list.append((display_name, track_id, idx))
    except ValueError:
        print(f"Warning: Track '{track_id}' ({display_name}) not found in config. Skipping...")

print(f"Found {len(track_data_list)} tracks to save from tracks_to_plot")

# Get predictions (shape: (49152, 7362))
bigwig_logits = ntv3_predictions.bigwig_tracks_logits
if isinstance(bigwig_logits, torch.Tensor):
    bigwig_logits = bigwig_logits.detach().cpu().numpy()

# Calculate genomic coordinates for the center 37.5% region
# The predictions cover the center 37.5% of the input sequence
input_length = end - start
center_start_offset = int(input_length * 0.3125)  # (1 - 0.375) / 2 = 0.3125
center_length = int(input_length * 0.375)
center_start = start + center_start_offset
center_end = center_start + center_length

print(f"Input region: {chrom}:{start}-{end} (length: {input_length:,} bp)")
print(f"Prediction region: {chrom}:{center_start}-{center_end} (length: {center_length:,} bp)")
print(f"Number of positions: {bigwig_logits.shape[0]}")

# Create output directory
output_dir = "bigwig_outputs"
os.makedirs(output_dir, exist_ok=True)

# Save each track as a separate BigWig file
print(f"\nSaving BigWig files to '{output_dir}/' directory...")
for i, (display_name, track_id, track_idx) in enumerate(track_data_list):
    # Get track data (logits for this track)
    track_data = bigwig_logits[:, track_idx].astype(np.float32)
    
    # Create BigWig file using display name (key) for filename
    # Clean the display name for use as filename (replace spaces, special chars)
    track_clean_name = display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
    bw_filename = os.path.join(output_dir, f"{track_clean_name}.bw")
    bw = pyBigWig.open(bw_filename, "w")
    
    # Add header (chromosome and size)
    bw.addHeader([(chrom, end)])
    
    # Add entries (intervals with values)
    # Each position in track_data corresponds to one base pair
    starts = np.arange(center_start, center_start + len(track_data), dtype=np.int64)
    ends = starts + 1
    values = track_data.tolist()
    
    bw.addEntries(
        chroms=[chrom] * len(starts),
        starts=starts.tolist(),
        ends=ends.tolist(),
        values=values
    )
    
    bw.close()
    
    print(f"  Saved {i + 1}/{len(track_data_list)}: {display_name} ({track_clean_name}.bw)")

print(f"\n✅ Successfully saved {len(track_data_list)} BigWig files to '{output_dir}/'")
print(f"   Files: {', '.join([name.replace(' ', '_').replace('/', '_').replace('-', '_') for name, _, _ in track_data_list])}")</code></pre></div>
    <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6; white-space: pre-wrap;">
      <strong style="color: var(--muted);">Output:</strong><br>Found 8 tracks to save from tracks_to_plot
Input region: chr19:6700000-6831072 (length: 131,072 bp)
Prediction region: chr19:6740960-6790112 (length: 49,152 bp)
Number of positions: 49152

Saving BigWig files to 'bigwig_outputs/' directory...
  Saved 1/8: K562 RNA-seq (K562_RNA_seq.bw)
  Saved 2/8: K562 DNAse (K562_DNAse.bw)
  Saved 3/8: K562 H3k4me3 (K562_H3k4me3.bw)
  Saved 4/8: K562 CTCF (K562_CTCF.bw)
  Saved 5/8: HepG2 RNA-seq (HepG2_RNA_seq.bw)
  Saved 6/8: HepG2 DNAse (HepG2_DNAse.bw)
  Saved 7/8: HepG2 H3k4me3 (HepG2_H3k4me3.bw)
  Saved 8/8: HepG2 CTCF (HepG2_CTCF.bw)

✅ Successfully saved 8 BigWig files to 'bigwig_outputs/'
   Files: K562_RNA_seq, K562_DNAse, K562_H3k4me3, K562_CTCF, HepG2_RNA_seq, HepG2_DNAse, HepG2_H3k4me3, HepG2_CTCF
    </div>
    <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
      This saves each selected functional track as a separate BigWig file that can be visualized in genome browsers. The files are saved with user-friendly display names (e.g., "K562_RNA_seq.bw").
    </p>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>5) 🌐 Create an IGV Browser</h2>
    <div class="code"><pre><code class="language-python">import igv_notebook

igv_notebook.init()

# Build tracks array with all BigWig files we saved
tracks = []
for track_display_name, track_id in tracks_to_plot.items():
    # Clean the display name to match the filename we saved
    track_clean_name = track_display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
    bigwig_path = os.path.join(output_dir, f"{track_clean_name}.bw")
    bigwig_track = {
        "name": track_display_name,
        "format": "bigwig",
        "url": bigwig_path,
        "height": 70,
        "autoscale": True,
        "displayMode": "EXPANDED",
    }
    tracks.append(bigwig_track)

config = {
    "genome": assembly,
    "locus": f"{chrom}:{center_start}-{center_end}",
    "tracks": tracks,
    "theme": "dark",
}

browser = igv_notebook.Browser(config)
browser  # <- just return the object, no .show()</code></pre></div>
    <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
      This creates an interactive IGV browser visualization with a dark theme showing all the predicted functional tracks. The BigWig files can also be visualized in any genome browser.
    </p>
  </div>

  <div class="card" style="grid-column: span 12;">
    <h2>📓 Full Notebook</h2>
    <p>To view and run the complete notebook interactively:</p>
    <ul>
      <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a></li>
      <li>Download and run in Jupyter, Google Colab, or any notebook environment</li>
    </ul>
  </div>
</div>