bernardo-de-almeida commited on
Commit
de62c11
·
1 Parent(s): 774a1b2

Remove HMM genome annotation pipeline

Browse files
index.html CHANGED
@@ -275,7 +275,6 @@
275
  <button class="tab-button active" data-tab="home">🏠 Home</button>
276
  <button class="tab-button" data-tab="demo">🚀 Live Demo</button>
277
  <button class="tab-button" data-tab="functional_tracks">💻 Functional Tracks</button>
278
- <button class="tab-button" data-tab="annotation">🧬 Genome Annotation</button>
279
  </div>
280
 
281
  <!-- Home Tab (Content loaded from tabs/home.html) -->
 
275
  <button class="tab-button active" data-tab="home">🏠 Home</button>
276
  <button class="tab-button" data-tab="demo">🚀 Live Demo</button>
277
  <button class="tab-button" data-tab="functional_tracks">💻 Functional Tracks</button>
 
278
  </div>
279
 
280
  <!-- Home Tab (Content loaded from tabs/home.html) -->
notebooks_pipelines/02_genome_annotation.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
tabs/annotation.html DELETED
@@ -1,147 +0,0 @@
1
- <div class="summary">
2
- <h2>🧬 NTv3 Post-Trained Genome Annotation</h2>
3
- <p>This notebook demonstrates how to use the NTv3 post-trained model to perform genome annotation directly from a DNA sequence. It relies on a pipeline that applies a Hidden Markov Model (HMM) to the per-base probabilities returned by NTv3, converting them into a coherent gene model that respects biological constraints and valid transitions between genomic elements.</p>
4
- <p>The pipeline abstracts away all the underlying steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation. It returns a ready-to-use GFF file that can be visualized in any genome browser for the sequence of interest.</p>
5
- <p>If you're interested in exploring the intermediate probabilities, please refer to the <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener">track-prediction notebook</a>. These probabilities can be useful for assessing model confidence and identifying potentially interesting biological regions. This notebook focuses on the higher-level task of producing gene annotations directly from raw DNA.</p>
6
- <p>
7
- <strong>🔗 Quick links:</strong><br>
8
- • <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_genome_annotation.ipynb" target="_blank" rel="noopener">View notebook on Hugging Face</a><br>
9
- • <a href="https://colab.research.google.com/github/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_genome_annotation.ipynb" target="_blank" rel="noopener">Open directly in Google Colab</a>
10
- </p>
11
- </div>
12
-
13
- <div class="grid">
14
- <div class="card" style="grid-column: span 12;">
15
- <h2>0) 📦 Imports + setup</h2>
16
- <p>Install dependencies:</p>
17
- <div class="code"><pre><code class="language-bash">pip -q install "transformers>=4.55" "huggingface_hub>=0.23" safetensors torch pyfaidx requests seaborn matplotlib igv_notebook</code></pre></div>
18
-
19
- <p style="margin-top: 20px;">Import required libraries:</p>
20
- <div class="code"><pre><code class="language-python">import re
21
- import time
22
- import torch
23
- import requests
24
- from transformers import pipeline</code></pre></div>
25
- </div>
26
-
27
- <div class="card" style="grid-column: span 12;">
28
- <h2>1) 📦 Configuration</h2>
29
- <p>Set your NTv3 model and genomic window here:</p>
30
- <div class="code"><pre><code class="language-python"># Define the model and genomic window
31
- model_name = "InstaDeepAI/NTv3_650M_pos"
32
- assembly = "hg38"
33
- chrom = "chr19"
34
- start = 6_700_000
35
- end = 6_831_072</code></pre></div>
36
- </div>
37
-
38
- <div class="card" style="grid-column: span 12;">
39
- <h2>2) 📥 Fetch chromosome sequence for the chosen window</h2>
40
- <div class="code"><pre><code class="language-python"># Get the sequence from the UCSC API
41
- url = f"https://api.genome.ucsc.edu/getData/sequence?genome={assembly};chrom={chrom};start={start};end={end}"
42
- seq = requests.get(url).json()["dna"].upper()
43
- print(f"Original sequence length: {len(seq)}")
44
-
45
- # Crop to multiple of 128 (the pipeline will crop again, but this is a no-op once divisible)
46
- seq = seq[:int(len(seq) // 128) * 128]
47
- print(f"Cropped sequence length: {len(seq)}, {len(seq) / 128} transformer tokens")</code></pre></div>
48
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
49
- <strong style="color: var(--muted);">Output:</strong><br>
50
- Original sequence length: 131072<br>
51
- Cropped sequence length: 131072, 1024.0 transformer tokens
52
- </div>
53
- </div>
54
-
55
- <div class="card" style="grid-column: span 12;">
56
- <h2>3) ⚡ Genome annotation pipeline (pre-processing, inference, post-processing)</h2>
57
- <div class="code"><pre><code class="language-python"># Build NTv3 GFF pipeline
58
- ntv3_gff = pipeline(
59
- "ntv3-gff",
60
- model=model_name,
61
- trust_remote_code=True,
62
- device=0 if torch.cuda.is_available() else -1,
63
- )
64
-
65
- # Run pipeline: DNA -> NTv3 -> HMM -> GFF3
66
- inputs = {
67
- "sequence": seq,
68
- "chrom": chrom,
69
- "start": start,
70
- "end": end,
71
- "assembly": assembly,
72
- }
73
-
74
- # Run the pipeline
75
- start_time = time.time()
76
- gff_text = ntv3_gff(inputs)
77
- end_time = time.time()
78
- print(f"Inference + decoding time: {end_time - start_time:.2f} seconds")</code></pre></div>
79
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
80
- <strong style="color: var(--muted);">Output:</strong><br>
81
- A new version of the following files was downloaded from https://huggingface.co/InstaDeepAI/NTv3_650M_pos:<br>
82
- - ntv3_gff_pipeline.py<br>
83
- . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.<br>
84
- Device set to use cpu<br>
85
- Inference + decoding time: 53.09 seconds
86
- </div>
87
- <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
88
- The pipeline performs all the necessary steps: running inference with the model, retrieving and processing the predicted probabilities, and applying the HMM to generate a consistent annotation.
89
- </p>
90
- </div>
91
-
92
- <div class="card" style="grid-column: span 12;">
93
- <h2>4) 📁 Save a GFF file</h2>
94
- <div class="code"><pre><code class="language-python"># Save GFF3 file
95
- short_model_name_match = re.search(r"[^/]+$", model_name)
96
- short_model_name = short_model_name_match.group() if short_model_name_match else model_name
97
-
98
- output_filename = f"{short_model_name}_{assembly}_{chrom}_{start}_{end}.gff3"
99
- with open(output_filename, "w") as output_file:
100
- output_file.write(gff_text)
101
-
102
- print(f"Saved GFF file to {output_filename}")</code></pre></div>
103
- <div style="margin-top: 15px; padding: 12px 16px; background: rgba(0, 0, 0, 0.4); border: 1px solid var(--border); border-radius: 8px; font-family: var(--mono); font-size: 12px; color: rgba(255, 255, 255, 0.85); line-height: 1.6;">
104
- <strong style="color: var(--muted);">Output:</strong><br>
105
- Saved GFF file to NTv3_650M_pos_hg38_chr19_6700000_6831072.gff3
106
- </div>
107
- </div>
108
-
109
- <div class="card" style="grid-column: span 12;">
110
- <h2>5) 🌐 Create an IGV Browser</h2>
111
- <div class="code"><pre><code class="language-python">import igv_notebook
112
-
113
- igv_notebook.init()
114
-
115
- config = {
116
- "genome": "hg38", # built-in hg38
117
- "locus": f"{chrom}:{start}-{end}",
118
- }
119
-
120
- gff_track = {
121
- "name": "NTv3 annotations",
122
- "format": "gff3",
123
- "type": "annotation",
124
- "url": output_filename, # just the filename
125
- }
126
-
127
- browser = igv_notebook.Browser(config)
128
- browser.load_track(gff_track)
129
-
130
- # Re-center on the region, just to be sure
131
- browser.search(f"{chrom}:{start}-{end}")
132
- browser # <- just return the object, no .show()</code></pre></div>
133
- <p style="margin-top: 15px; color: var(--muted); font-size: 13px;">
134
- This creates an interactive IGV browser visualization of the annotations. The GFF file can also be visualized in any genome browser.
135
- </p>
136
- </div>
137
-
138
- <div class="card" style="grid-column: span 12;">
139
- <h2>📓 Full Notebook</h2>
140
- <p>To view and run the complete notebook interactively:</p>
141
- <ul>
142
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_genome_annotation.ipynb" target="_blank" rel="noopener">View notebook on Hugging Face</a></li>
143
- <li>Download and run in Jupyter, Google Colab, or any notebook environment</li>
144
- </ul>
145
- </div>
146
- </div>
147
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tabs/home.html CHANGED
@@ -93,10 +93,9 @@
93
  <h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener">folder</a>)</h2>
94
  <ul>
95
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
96
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_genome_annotation.ipynb" target="_blank" rel="noopener">🏷️ 02 — Genome annotation / segmentation</a></li>
97
- <li>🎯 03 — Fine-tune on bigwig tracks</li>
98
- <li>🔍 04 — Interpret a given genomic region</li>
99
- <li>🧪 05 — Sequence generation</li>
100
  </ul>
101
  </div>
102
  <div class="card">
 
93
  <h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener">folder</a>)</h2>
94
  <ul>
95
  <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
96
+ <li>🎯 02 — Fine-tune on bigwig tracks</li>
97
+ <li>🔍 03 — Interpret a given genomic region</li>
98
+ <li>🧪 04 — Sequence generation</li>
 
99
  </ul>
100
  </div>
101
  <div class="card">