bernardo-de-almeida commited on
Commit
80d0a14
·
1 Parent(s): 7b1a017

fix: remove functional_tracks

Browse files
Files changed (1) hide show
  1. tabs/functional_tracks.html +0 -253
tabs/functional_tracks.html DELETED
@@ -1,253 +0,0 @@
1
- <div class="summary">
2
- <h2>🧬 NTv3 Post-Trained Functional Track Prediction</h2>
3
- <p>This notebook demonstrates how to use the NTv3 post-trained model to predict functional tracks and genome annotation directly from a DNA sequence.</p>
4
- <p>The pipeline abstracts away all the underlying steps: running inference with the model and plotting the predictions per tracks.</p>
5
- <p>If you're interested in exploring the intermediate probabilities, please refer to the <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">track-prediction notebook</a>.</p>
6
- <p>
7
- <strong>🔗 Quick links:</strong><br>
8
- • <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a><br>
9
- • <a href="https://colab.research.google.com/github/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">Open directly in Google Colab</a>
10
- </p>
11
- </div>
12
-
13
- <div class="grid">
14
- <div class="card" style="grid-column: span 12;">
15
- <h2>0) 📦 Imports + setup</h2>
16
- <p>Install dependencies:</p>
17
- <div class="code"><pre><code class="language-bash">pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook pyBigWig</code></pre></div>
18
-
19
- <p style="margin-top: 20px;">Import required libraries:</p>
20
- <div class="code"><pre><code class="language-python">import re
21
- import time
22
- import os
23
- import torch
24
- import requests
25
- import numpy as np
26
- import pyBigWig
27
- from transformers import pipeline, AutoConfig</code></pre></div>
28
- </div>
29
-
30
- <div class="card" style="grid-column: span 12;">
31
- <h2>1) 📦 Configuration</h2>
32
- <p>Set your NTv3 model and genomic window here:</p>
33
- <div class="code"><pre><code class="language-python"># Define the model and genomic window
34
- model_name = "InstaDeepAI/NTv3_650M_pos"
35
-
36
- species = "human" # will use for condition the model on species
37
- assembly = "hg38" # will use for fetching the chromosome sequence
38
-
39
- chrom = "chr19"
40
- start = 6_700_000
41
- end = 6_831_072</code></pre></div>
42
- </div>
43
-
44
- <div class="card" style="grid-column: span 12;">
45
- <h2>2) 📥 Fetch chromosome sequence for the chosen window</h2>
46
- <div class="code"><pre><code class="language-python"># Get the sequence from the UCSC API
47
- url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}"
48
- seq = requests.get(url).json()["dna"].upper()
49
- print(f"Original sequence length: {len(seq)}")
50
-
51
- # Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)
52
- seq = seq[:int(len(seq) // 128) * 128]
53
- print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")</code></pre></div>
54
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
55
- <strong style="color: var(--muted);">Output:</strong><br>
56
- Original sequence length: 131072<br>
57
- Cropped sequence length: 131072, 1024.0 transformer tokens
58
- </div>
59
- </div>
60
-
61
- <div class="card" style="grid-column: span 12;">
62
- <h2>3) ⚡ Functional track prediction pipeline (pre-processing, inference, plotting)</h2>
63
- <div class="code"><pre><code class="language-python"># Build NTv3 tracks pipeline
64
- ntv3_tracks = pipeline(
65
- "ntv3-tracks",
66
- model=model_name,
67
- trust_remote_code=True,
68
- device=0 if torch.cuda.is_available() else -1,
69
- )
70
-
71
- # Select tracks to plot
72
- tracks_to_plot = {
73
- "K562 RNA-seq": "ENCSR056HPM",
74
- "K562 DNAse": "ENCSR921NMD",
75
- "K562 H3k4me3": "ENCSR000DWD",
76
- "K562 CTCF": "ENCSR000AKO",
77
- "HepG2 RNA-seq": "ENCSR561FEE_P",
78
- "HepG2 DNAse": "ENCSR000EJV",
79
- "HepG2 H3k4me3": "ENCSR000AMP",
80
- "HepG2 CTCF": "ENCSR000BIE",
81
- }
82
- elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]
83
-
84
- # Run pipeline: DNA -> NTv3 -> Tracks -> plot
85
- start_time = time.time()
86
-
87
- ntv3_predictions = ntv3_tracks(
88
- {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": species},
89
- plot=True,
90
- tracks_to_plot=tracks_to_plot,
91
- elements_to_plot=elements_to_plot,
92
- )
93
-
94
- end_time = time.time()
95
-
96
- print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")</code></pre></div>
97
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
98
- <strong style="color: var(--muted);">Output:</strong><br>
99
- Device set to use cpu<br>
100
- Running on device: cpu<br>
101
- Inference + decoding time: 38.32 seconds
102
- </div>
103
- <div style="margin-top: 20px;">
104
- <img src="assets/output_tracks.png" alt="Output tracks plot" style="width: 100%; height: auto; border-radius: 12px; border: 1px solid var(--border);" />
105
- </div>
106
- <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
107
- The pipeline performs all the necessary steps: running inference with the model and plotting the predictions for the specified tracks and genomic elements.
108
- </p>
109
- </div>
110
-
111
- <div class="card" style="grid-column: span 12;">
112
- <h2>4) 📁 Save as BigWig file</h2>
113
- <div class="code"><pre><code class="language-python"># Load config to get track names and find indices for tracks_to_plot
114
- cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
115
- all_bigwig_names = cfg.bigwigs_per_file_assembly[assembly]
116
-
117
- # Find indices of tracks we want to save
118
- # Use display names (keys) for filenames, but track IDs (values) to find indices
119
- track_data_list = [] # List of (display_name, track_id, index) tuples
120
- for display_name, track_id in tracks_to_plot.items():
121
- try:
122
- idx = all_bigwig_names.index(track_id)
123
- track_data_list.append((display_name, track_id, idx))
124
- except ValueError:
125
- print(f"Warning: Track '{track_id}' ({display_name}) not found in config. Skipping...")
126
-
127
- print(f"Found {len(track_data_list)} tracks to save from tracks_to_plot")
128
-
129
- # Get predictions (shape: (49152, 7362))
130
- bigwig_logits = ntv3_predictions.bigwig_tracks_logits
131
- if isinstance(bigwig_logits, torch.Tensor):
132
- bigwig_logits = bigwig_logits.detach().cpu().numpy()
133
-
134
- # Calculate genomic coordinates for the center 37.5% region
135
- # The predictions cover the center 37.5% of the input sequence
136
- input_length = end - start
137
- center_start_offset = int(input_length * 0.3125) # (1 - 0.375) / 2 = 0.3125
138
- center_length = int(input_length * 0.375)
139
- center_start = start + center_start_offset
140
- center_end = center_start + center_length
141
-
142
- print(f"Input region: {chrom}:{start}-{end} (length: {input_length:,} bp)")
143
- print(f"Prediction region: {chrom}:{center_start}-{center_end} (length: {center_length:,} bp)")
144
- print(f"Number of positions: {bigwig_logits.shape[0]}")
145
-
146
- # Create output directory
147
- output_dir = "bigwig_outputs"
148
- os.makedirs(output_dir, exist_ok=True)
149
-
150
- # Save each track as a separate BigWig file
151
- print(f"\nSaving BigWig files to '{output_dir}/' directory...")
152
- for i, (display_name, track_id, track_idx) in enumerate(track_data_list):
153
- # Get track data (logits for this track)
154
- track_data = bigwig_logits[:, track_idx].astype(np.float32)
155
-
156
- # Create BigWig file using display name (key) for filename
157
- # Clean the display name for use as filename (replace spaces, special chars)
158
- track_clean_name = display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
159
- bw_filename = os.path.join(output_dir, f"{track_clean_name}.bw")
160
- bw = pyBigWig.open(bw_filename, "w")
161
-
162
- # Add header (chromosome and size)
163
- bw.addHeader([(chrom, end)])
164
-
165
- # Add entries (intervals with values)
166
- # Each position in track_data corresponds to one base pair
167
- starts = np.arange(center_start, center_start + len(track_data), dtype=np.int64)
168
- ends = starts + 1
169
- values = track_data.tolist()
170
-
171
- bw.addEntries(
172
- chroms=[chrom] * len(starts),
173
- starts=starts.tolist(),
174
- ends=ends.tolist(),
175
- values=values
176
- )
177
-
178
- bw.close()
179
-
180
- print(f" Saved {i + 1}/{len(track_data_list)}: {display_name} ({track_clean_name}.bw)")
181
-
182
- print(f"\n✅ Successfully saved {len(track_data_list)} BigWig files to '{output_dir}/'")
183
- print(f" Files: {', '.join([name.replace(' ', '_').replace('/', '_').replace('-', '_') for name, _, _ in track_data_list])}")</code></pre></div>
184
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6; white-space: pre-wrap;">
185
- <strong style="color: var(--muted);">Output:</strong><br>Found 8 tracks to save from tracks_to_plot
186
- Input region: chr19:6700000-6831072 (length: 131,072 bp)
187
- Prediction region: chr19:6740960-6790112 (length: 49,152 bp)
188
- Number of positions: 49152
189
-
190
- Saving BigWig files to 'bigwig_outputs/' directory...
191
- Saved 1/8: K562 RNA-seq (K562_RNA_seq.bw)
192
- Saved 2/8: K562 DNAse (K562_DNAse.bw)
193
- Saved 3/8: K562 H3k4me3 (K562_H3k4me3.bw)
194
- Saved 4/8: K562 CTCF (K562_CTCF.bw)
195
- Saved 5/8: HepG2 RNA-seq (HepG2_RNA_seq.bw)
196
- Saved 6/8: HepG2 DNAse (HepG2_DNAse.bw)
197
- Saved 7/8: HepG2 H3k4me3 (HepG2_H3k4me3.bw)
198
- Saved 8/8: HepG2 CTCF (HepG2_CTCF.bw)
199
-
200
- ✅ Successfully saved 8 BigWig files to 'bigwig_outputs/'
201
- Files: K562_RNA_seq, K562_DNAse, K562_H3k4me3, K562_CTCF, HepG2_RNA_seq, HepG2_DNAse, HepG2_H3k4me3, HepG2_CTCF
202
- </div>
203
- <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
204
- This saves each selected functional track as a separate BigWig file that can be visualized in genome browsers. The files are saved with user-friendly display names (e.g., "K562_RNA_seq.bw").
205
- </p>
206
- </div>
207
-
208
- <div class="card" style="grid-column: span 12;">
209
- <h2>5) 🌐 Create an IGV Browser</h2>
210
- <div class="code"><pre><code class="language-python">import igv_notebook
211
-
212
- igv_notebook.init()
213
-
214
- # Build tracks array with all BigWig files we saved
215
- tracks = []
216
- for track_display_name, track_id in tracks_to_plot.items():
217
- # Clean the display name to match the filename we saved
218
- track_clean_name = track_display_name.replace(" ", "_").replace("/", "_").replace("-", "_")
219
- bigwig_path = os.path.join(output_dir, f"{track_clean_name}.bw")
220
- bigwig_track = {
221
- "name": track_display_name,
222
- "format": "bigwig",
223
- "url": bigwig_path,
224
- "height": 70,
225
- "autoscale": True,
226
- "displayMode": "EXPANDED",
227
- }
228
- tracks.append(bigwig_track)
229
-
230
- config = {
231
- "genome": assembly,
232
- "locus": f"{chrom}:{center_start}-{center_end}",
233
- "tracks": tracks,
234
- "theme": "dark",
235
- }
236
-
237
- browser = igv_notebook.Browser(config)
238
- browser # <- just return the object, no .show()</code></pre></div>
239
- <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
240
- This creates an interactive IGV browser visualization with a dark theme showing all the predicted functional tracks. The BigWig files can also be visualized in any genome browser.
241
- </p>
242
- </div>
243
-
244
- <div class="card" style="grid-column: span 12;">
245
- <h2>📓 Full Notebook</h2>
246
- <p>To view and run the complete notebook interactively:</p>
247
- <ul>
248
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">View notebook on Hugging Face</a></li>
249
- <li>Download and run in Jupyter, Google Colab, or any notebook environment</li>
250
- </ul>
251
- </div>
252
- </div>
253
-