bernardo-de-almeida commited on
Commit
1fb2a3c
·
1 Parent(s): df00fce

feat: improve main page

Browse files
Files changed (2) hide show
  1. README.md +57 -29
  2. index.html +88 -34
README.md CHANGED
@@ -7,63 +7,91 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- # NTv3 — Foundation Models for Long-Range Genomics
11
 
12
  This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation.
13
 
14
- ## Notebooks
 
 
 
 
 
 
15
 
16
  Notebooks live in `./notebooks/`:
17
 
18
- - `00_quickstart_inference.ipynb` — load a checkpoint + run inference
19
- - `01_tracks_prediction.ipynb` — sequence → functional tracks (+ plotting)
20
- - `02_genome_annotation_segmentation.ipynb` — sequence → annotation
21
- - `03_finetune_head.ipynb` — fine-tune on bigwig tracks
22
- - `04_model_interpretation.ipynb` — interpretation of post-trained model
23
- - `05_sequence_generation.ipynb` — fine-tune NTv3 to generate enhancer sequences
24
 
25
- ## Install
26
 
27
  ```bash
28
  pip install torch transformers accelerate safetensors huggingface_hub numpy
29
  ```
30
 
31
- ## Load a model (To DO)
32
 
33
  ```python
 
34
 
 
 
 
35
 
 
 
 
 
 
 
36
  ```
37
 
38
- ## Pipelines (To DO)
 
 
39
 
40
  ```python
41
- from transformers import pipeline
42
- import torch
43
-
44
- pipe = pipeline(
45
- task="ntv3-tracks",
46
- model="InstaDeepAI/NTv3_650M",
47
- trust_remote_code=True,
48
- device="cuda",
49
- torch_dtype=torch.bfloat16,
 
 
 
 
 
 
 
50
  )
51
 
52
- out = pipe("ACGT...")
 
 
53
  ```
54
 
55
- ## Checkpoints
56
 
57
- **Pre-trained:** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre`
58
 
59
- **Post-trained:** `InstaDeepAI/NTv3_100M`, `InstaDeepAI/NTv3_650M`
60
 
61
- ## Links
62
 
63
- - **Paper:** (add link)
64
- - **JAX research code (GitHub):** [https://github.com/instadeepai/nucleotide-transformer](https://github.com/instadeepai/nucleotide-transformer)
 
65
 
66
- ## Citation
67
 
68
  ```bibtex
69
  @article{ntv3,
@@ -74,7 +102,7 @@ out = pipe("ACGT...")
74
  }
75
  ```
76
 
77
- ## License
78
 
79
  **Code & notebooks in this Space:** (choose and add, e.g., Apache-2.0)
80
 
 
7
  pinned: false
8
  ---
9
 
10
+ # 🧬 NTv3 — Foundation Models for Long-Range Genomics
11
 
12
  This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation.
13
 
14
+ ## 📖 About NTv3
15
+
16
+ NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
17
+
18
+ Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
19
+
20
+ ## 📓 Notebooks
21
 
22
  Notebooks live in `./notebooks/`:
23
 
24
+ - 🚀 `00_quickstart_inference.ipynb` — load a checkpoint + run inference
25
+ - 📊 `01_tracks_prediction.ipynb` — sequence → functional tracks (+ plotting)
26
+ - 🏷️ `02_genome_annotation_segmentation.ipynb` — sequence → annotation
27
+ - 🎯 `03_finetune_head.ipynb` — fine-tune on bigwig tracks
28
+ - 🔍 `04_model_interpretation.ipynb` — interpretation of post-trained model
29
+ - 🧪 `05_sequence_generation.ipynb` — fine-tune NTv3 to generate enhancer sequences
30
 
31
+ ## 📦 Install
32
 
33
  ```bash
34
  pip install torch transformers accelerate safetensors huggingface_hub numpy
35
  ```
36
 
37
+ ## 🤖 Load a pre-trained model
38
 
39
  ```python
40
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
41
 
42
+ repo = "InstaDeepAI/NTv3_650M_pre"
43
+ tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
44
+ model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
45
 
46
+ batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
47
+ out = model(**batch, output_hidden_states=True, output_attentions=True)
48
+
49
+ print(out.logits.shape) # (B, L, V = 11)
50
+ print(len(out.hidden_states)) # convs + transformers + deconvs
51
+ print(len(out.attentions)) # equals transformer layers = 12
52
  ```
53
 
54
+ ## 💻 Pipelines
55
+
56
+ Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window.
57
 
58
  ```python
59
+ from transformers import AutoConfig
60
+
61
+ model_name = "InstaDeepAI/NTv3_100M"
62
+
63
+ # Load track prediction pipeline
64
+ cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True)
65
+ pipe = cfg.load_tracks_pipeline(model_name, device="auto") # or "cpu"/"cuda"/"mps"
66
+
67
+ # Run track prediction
68
+ out = pipe(
69
+ {
70
+ "chrom": "chr19",
71
+ "start": 6_700_000,
72
+ "end": 6_831_072,
73
+ "species": "human"
74
+ }
75
  )
76
 
77
+ print(out.bigwig_tracks_logits.shape) # functional track predictions
78
+ print(out.bed_tracks_logits.shape) # genome annotation predictions
79
+ print(out.mlm_logits.shape) # MLM logits: (B, L, V = 11)
80
  ```
81
 
82
+ ## 🤖 Checkpoints
83
 
84
+ **📦 Pre-trained:** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre`
85
 
86
+ **🎯 Post-trained:** `InstaDeepAI/NTv3_100M`, `InstaDeepAI/NTv3_650M`
87
 
88
+ ## 🔗 Links
89
 
90
+ - **📄 Paper:** (add link)
91
+ - **💻 JAX research code (GitHub):** [https://github.com/instadeepai/nucleotide-transformer](https://github.com/instadeepai/nucleotide-transformer)
92
+ - **🏆 NTv3 benchmark leaderboard: (add link)**
93
 
94
+ ## 📝 Citation
95
 
96
  ```bibtex
97
  @article{ntv3,
 
102
  }
103
  ```
104
 
105
+ ## 📜 License
106
 
107
  **Code & notebooks in this Space:** (choose and add, e.g., Apache-2.0)
108
 
index.html CHANGED
@@ -5,6 +5,7 @@
5
  <meta name="viewport" content="width=device-width,initial-scale=1" />
6
  <title>NTv3 — Foundation Models for Long-Range Genomics</title>
7
  <meta name="description" content="NTv3 companion hub: PyTorch notebooks for inference, fine-tuning, interpretation, and sequence generation on NTv3 models hosted on Hugging Face." />
 
8
  <style>
9
  :root {
10
  --bg: #0b1020;
@@ -85,6 +86,36 @@
85
  font-size: inherit;
86
  color: inherit;
87
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  .paper-summary {
89
  margin-top: 12px;
90
  padding: 24px;
@@ -114,79 +145,100 @@
114
  <body>
115
  <div class="wrap">
116
  <div class="hero">
117
- <h1>NTv3 — Foundation Models for Long-Range Genomics</h1>
118
  <p>
119
  This Space is the companion hub for <strong>NTv3</strong> models: runnable notebooks for inference, fine-tuning, interpretation, and sequence generation.
120
  </p>
121
 
122
  <div class="pillrow">
123
- <span class="pill">Foundation Models</span>
124
- <span class="pill">Long-context genomics</span>
125
- <span class="pill">Multi-species</span>
126
- <span class="pill">Inference • Fine-tune • Interpret • Generate</span>
127
- <span class="pill">Torch notebooks</span>
128
  </div>
129
  </div>
130
 
 
 
 
 
 
 
 
 
 
 
131
  <div class="grid">
132
  <div class="card">
133
- <h2>Models</h2>
134
  <ul>
135
- <li>Pretrained checkpoints (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener">collection</a>):
136
  <div style="margin-top: 8px; margin-left: 0;">
137
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
138
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
139
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
140
  </div>
141
  </li>
142
- <li>Post-trained checkpoints:
143
  <div style="margin-top: 8px; margin-left: 0;">
144
- <div><a href="https://huggingface.co/InstaDeepAI/ntv3_650M_7downsample_post_trained_1mb"><code>InstaDeepAI/ntv3_650M_7downsample_post_trained_1mb</code></a></div>
145
- <div><a href="https://huggingface.co/InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb"><code>InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb</code></a></div>
146
  </div>
147
  </li>
148
  </ul>
149
  </div>
150
 
151
  <div class="card">
152
- <h2>Notebooks</h2>
153
  <ul>
154
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks" target="_blank" rel="noopener">Browse notebooks folder</a></li>
155
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks/00_quickstart_inference.ipynb" target="_blank" rel="noopener">00Quickstart inference</a></li>
156
- <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks/01_tracks_prediction.ipynb" target="_blank" rel="noopener">01 Tracks prediction</a></li>
157
- <li>02Genome annotation / segmentation</li>
158
- <li>03Fine-tune on bigwig tracks</li>
159
- <li>04Model interpretation</li>
160
- <li>05 — Sequence generation</li>
161
  </ul>
162
  </div>
163
 
164
  <div class="card">
165
- <h2>Model usage (to update)</h2>
166
- <p>Here is a quick example of how to use NTv3 models.</p>
167
- <div class="code"><code>from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
- pipe = pipeline(
170
- task="ntv3-tracks",
171
- model="InstaDeepAI/ntv3_106M_7downsample_post_trained_1mb",
172
- trust_remote_code=True,
173
- device="cuda",
174
- torch_dtype=torch.bfloat16,
175
- )</code></div>
176
  </div>
177
 
178
  <div class="card">
179
- <h2>Links</h2>
180
  <ul>
181
- <li>Paper: (add link)</li>
182
- <li><a href="https://github.com/instadeepai/nucleotide-transformer">JAX model code (GitHub)</a></li>
183
- <li>NTv3 benchmark leaderboard: (add link)</li>
184
  </ul>
185
  </div>
186
  </div>
187
 
188
  <div class="paper-summary">
189
- <h2>A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2>
190
  <img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
191
  </div>
192
 
@@ -194,5 +246,7 @@ pipe = pipeline(
194
  © instadeep-ai — NTv3 companion Space.
195
  </p>
196
  </div>
 
 
197
  </body>
198
  </html>
 
5
  <meta name="viewport" content="width=device-width,initial-scale=1" />
6
  <title>NTv3 — Foundation Models for Long-Range Genomics</title>
7
  <meta name="description" content="NTv3 companion hub: PyTorch notebooks for inference, fine-tuning, interpretation, and sequence generation on NTv3 models hosted on Hugging Face." />
8
+ <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism-tomorrow.min.css" rel="stylesheet" />
9
  <style>
10
  :root {
11
  --bg: #0b1020;
 
86
  font-size: inherit;
87
  color: inherit;
88
  }
89
+ /* Prism.js theme overrides to match dark theme */
90
+ .code pre[class*="language-"] {
91
+ background: transparent;
92
+ margin: 0;
93
+ padding: 0;
94
+ }
95
+ .code code[class*="language-"] {
96
+ background: transparent;
97
+ }
98
+ .summary {
99
+ margin-top: 18px;
100
+ padding: 24px;
101
+ border: 1px solid var(--border);
102
+ background: var(--card);
103
+ border-radius: var(--radius);
104
+ box-shadow: var(--shadow);
105
+ }
106
+ .summary h2 {
107
+ margin: 0 0 16px 0;
108
+ font-size: 18px;
109
+ letter-spacing: 0.01em;
110
+ }
111
+ .summary p {
112
+ margin: 0 0 14px 0;
113
+ color: var(--muted);
114
+ line-height: 1.7;
115
+ }
116
+ .summary p:last-child {
117
+ margin-bottom: 0;
118
+ }
119
  .paper-summary {
120
  margin-top: 12px;
121
  padding: 24px;
 
145
  <body>
146
  <div class="wrap">
147
  <div class="hero">
148
+ <h1>🧬 NTv3 — Foundation Models for Long-Range Genomics</h1>
149
  <p>
150
  This Space is the companion hub for <strong>NTv3</strong> models: runnable notebooks for inference, fine-tuning, interpretation, and sequence generation.
151
  </p>
152
 
153
  <div class="pillrow">
154
+ <span class="pill">🤖 Foundation Models</span>
155
+ <span class="pill">🧬 Long-context genomics</span>
156
+ <span class="pill">🌍 Multi-species</span>
157
+ <span class="pill">⚡ Inference • Fine-tune • Interpret • Generate</span>
158
+ <span class="pill">📓 Torch notebooks</span>
159
  </div>
160
  </div>
161
 
162
+ <div class="summary">
163
+ <h2>📖 About NTv3</h2>
164
+ <p>
165
+ NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
166
+ </p>
167
+ <p>
168
+ Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
169
+ </p>
170
+ </div>
171
+
172
  <div class="grid">
173
  <div class="card">
174
+ <h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener">collection</a>)</h2>
175
  <ul>
176
+ <li>📦 Pretrained checkpoints:
177
  <div style="margin-top: 8px; margin-left: 0;">
178
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
179
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
180
  <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
181
  </div>
182
  </li>
183
+ <li>🎯 Post-trained checkpoints:
184
  <div style="margin-top: 8px; margin-left: 0;">
185
+ <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M"><code>InstaDeepAI/NTv3_100M</code></a></div>
186
+ <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M"><code>InstaDeepAI/NTv3_650M</code></a></div>
187
  </div>
188
  </li>
189
  </ul>
190
  </div>
191
 
192
  <div class="card">
193
+ <h2>📓 Notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks" target="_blank" rel="noopener">folder</a>)</h2>
194
  <ul>
195
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks/00_quickstart_inference.ipynb" target="_blank" rel="noopener">🚀 00 — Quickstart inference</a></li>
196
+ <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks/01_tracks_prediction.ipynb" target="_blank" rel="noopener">📊 01 Tracks prediction</a></li>
197
+ <li>🏷️ 02 Genome annotation / segmentation</li>
198
+ <li>🎯 03 Fine-tune on bigwig tracks</li>
199
+ <li>🔍 04 Model interpretation</li>
200
+ <li>🧪 05 Sequence generation</li>
 
201
  </ul>
202
  </div>
203
 
204
  <div class="card">
205
+ <h2>💻 Model usage</h2>
206
+ <p>Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window.</p>
207
+ <div class="code"><pre><code class="language-python">from transformers import AutoConfig
208
+
209
+ model_name = "InstaDeepAI/NTv3_650M"
210
+
211
+ # Load track prediction pipeline
212
+ cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True)
213
+ pipe = cfg.load_tracks_pipeline(model_name, device="auto") # or "cpu"/"cuda"/"mps"
214
+
215
+ # Run track prediction
216
+ out = pipe(
217
+ {
218
+ "chrom": "chr19",
219
+ "start": 6_700_000,
220
+ "end": 6_831_072,
221
+ "species": "human"
222
+ }
223
+ )
224
 
225
+ print(out.bigwig_tracks_logits.shape) # functional track predictions
226
+ print(out.bed_tracks_logits.shape) # genome annotation predictions
227
+ print(out.mlm_logits.shape) # MLM logits: (B, L, V = 11)</code></pre></div>
 
 
 
 
228
  </div>
229
 
230
  <div class="card">
231
+ <h2>🔗 Links</h2>
232
  <ul>
233
+ <li>📄 Paper: (add link)</li>
234
+ <li><a href="https://github.com/instadeepai/nucleotide-transformer">💻 JAX model code (GitHub)</a></li>
235
+ <li>🏆 NTv3 benchmark leaderboard: (add link)</li>
236
  </ul>
237
  </div>
238
  </div>
239
 
240
  <div class="paper-summary">
241
+ <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2>
242
  <img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
243
  </div>
244
 
 
246
  © instadeep-ai — NTv3 companion Space.
247
  </p>
248
  </div>
249
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/components/prism-core.min.js"></script>
250
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/plugins/autoloader/prism-autoloader.min.js"></script>
251
  </body>
252
  </html>