File size: 12,652 Bytes
42f0385
 
 
 
 
 
7671682
42f0385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4675225
42f0385
 
 
4675225
 
 
42f0385
 
 
 
4675225
 
42f0385
 
 
 
 
 
 
 
 
 
421ebe7
42f0385
 
 
 
 
 
421ebe7
42f0385
 
 
 
 
 
 
421ebe7
d7bbd13
42f0385
 
 
 
 
421ebe7
d7bbd13
42f0385
 
 
 
 
 
 
4675225
42f0385
4675225
 
3b6e7d5
 
 
 
971caa1
0bf1172
42f0385
 
 
4675225
42f0385
4675225
ee00347
42f0385
 
 
 
 
31444bd
4675225
 
 
 
42f0385
 
 
 
3b6e7d5
 
 
 
 
42f0385
 
 
 
 
 
 
 
 
 
 
9759882
42f0385
 
 
 
3b6e7d5
 
 
 
 
 
75576cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b6e7d5
 
42f0385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
<div class="summary">
  <h2>📖 About NTv3</h2>
  <p>
    NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
  </p>
  <p>
    NTv3 also acts as a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
  </p>
</div>

<div class="paper-summary">
  <!-- <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2> -->
  <img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
</div>

<div class="why-ntv3">
  <h2>✨ Why NTv3?</h2>
  <ul>
    <li>📏 <strong>1 Mb long context at nucleotide resolution</strong> — ~100× longer than typical genomics models.</li>
    <li>🏗️ <strong>Unified architecture</strong> for: masked language modeling, functional-track prediction, genome annotation, and sequence generation.</li>
    <li>🌍 <strong>Cross-species generalization</strong> across 24 animals + plants with a shared conditioned representation space.</li>
    <li><strong>U-Net–style architecture</strong> improves stability and GPU efficiency on very long sequences.</li>
    <li>🎯 <strong>Controllable generative modeling</strong>, enabling targeted enhancer/promoter engineering validated by experimental assays.</li>
  </ul>
</div>

<div class="grid">
  <div class="card">
    <h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">collection</a>)</h2>
    <ul>
      <li>📦 Pretrained checkpoints:
        <div style="margin-top: 8px; margin-left: 0;">
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
        </div>
      </li>
      <li>🎯 Post-trained checkpoints:
        <div style="margin-top: 8px; margin-left: 0;">
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pos</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pos</code></a></div>
        </div>
      </li>
    </ul>
    <table>
      <thead>
        <tr>
          <th>Model</th>
          <th>Size</th>
          <th>Pre-training</th>
          <th>Post-training</th>
          <th>Usage</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td><strong>NTv3-8M</strong></td>
          <td>8M params</td>
          <td><span class="checkmark"></span></td>
          <td></td>
          <td>Embeddings, light inference</td>
        </tr>
        <tr>
          <td><strong>NTv3-100M</strong></td>
          <td>100M params</td>
          <td><span class="checkmark"></span></td>
          <td><span class="checkmark"></span></td>
          <td>Embeddings, tracks, annotation</td>
        </tr>
        <tr>
          <td><strong>NTv3-650M</strong></td>
          <td>650M params</td>
          <td><span class="checkmark"></span></td>
          <td><span class="checkmark"></span></td>
          <td>Embeddings, tracks, annotation, best accuracy</td>
        </tr>
      </tbody>
    </table>
  </div>

  <div class="card-stack">
    <div class="card">
      <h2>📓 Tutorial notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
      <ul>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb" target="_blank" rel="noopener noreferrer">🏷️ 04 — Fine-tune a pre-trained model on annotations</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/05_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 05 — Model interpretation</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/06_NTv3_generative_training.ipynb" target="_blank" rel="noopener noreferrer">🧪 06 — Fine-tuning NTv3 into a diffusion model</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/07_enhancer_generation.ipynb" target="_blank" rel="noopener noreferrer">🪰 07 — Generating enhancer sequences</a></li>
      </ul>
    </div>
    <div class="card">
      <h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
      <ul>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 02 — Interpret a given genomic region</a></li>
      </ul>
    </div>
    <div class="card">
      <h2>🔗 Links</h2>
      <ul>
        <li><a href="https://www.biorxiv.org/content/10.64898/2025.12.22.695963v1" target="_blank" rel="noopener noreferrer">📄 Paper</a></li>
        <li><a href="https://github.com/instadeepai/nucleotide-transformer" target="_blank" rel="noopener noreferrer">💻 JAX model code (GitHub)</a></li>
        <li><a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">🎯 HF Model Collection (all NTv3 models)</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">📚 Tutorial </a> and <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">🔧 Pipeline</a> notebooks</li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3_benchmark" target="_blank" rel="noopener noreferrer">🏆 NTv3 benchmark leaderboard</a></li>
      </ul>
    </div>
  </div>

  <div class="card-stack">
    <div class="card">
      <h2>🤖 Load a pre-trained model</h2>
      <p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
      <div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "InstaDeepAI/NTv3_650M_pre"

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")

# Run model
out = model(**batch)

# Print output shapes
print(out.logits.shape)       # (B, L, V = 11)
</code></pre></div>
      <p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
    </div>
    <div class="card">
      <h2>🔍 Model interpretation</h2>
      <p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
      <div class="code"><pre><code class="language-python">from transformers import pipeline
import torch
import matplotlib.pyplot as plt

model_name = "InstaDeepAI/NTv3_650M_post"

# Build interpretation pipeline
ntv3_interpret = pipeline(
    "ntv3-interpret",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Run interpretation on a given genomic region with tracks, annotations, attention, and saliency
result = ntv3_interpret(
    {"chrom": "chr11", "start": 5_253_561, "end": 5_286_329, "species": "human"},
    output_attention=True,
    output_saliency=True,
    saliency_track_id="ENCSR000EFT",  # K562 GATA1 ChIP-seq
    plot=True,  # plot predictons on tracks and annotations
    tracks_to_plot={"K562 RNA-seq": "ENCSR056HPM", "K562 GATA1": "ENCSR000EFT"},
    elements_to_plot=["exon", "promoter_Tissue_specific"],
)

# Access attention map results
result.plot_attention()  # attention map (last layer)
plt.show()

# Access saliency scores results
result.plot_saliency(window_size=128)
plt.show()
</code></pre></div>
      <img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
    </div>
  </div>
  
  <div class="card">
    <h2>💻 Use a post-trained model</h2>
    <p>Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.</p>
    <div class="code"><pre><code class="language-python">from transformers import pipeline
import torch

model_name = "InstaDeepAI/NTv3_650M_pos"

ntv3_tracks = pipeline(
    "ntv3-tracks",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Run track prediction
out = ntv3_tracks(
  {
    "chrom": "chr19",
    "start": 6_700_000,
    "end": 6_831_072,
    "species": "human"
  }
)

# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))</code></pre></div>
    <p>Predictions can also be plotted for a subset of functional tracks and genomic elements:</p>
    <div class="code"><pre><code class="language-python">tracks_to_plot = {
    "K562 RNA-seq": "ENCSR056HPM",
    "K562 DNAse": "ENCSR921NMD",
    "K562 H3k4me3": "ENCSR000DWD",
    "K562 CTCF": "ENCSR000AKO",
    "HepG2 RNA-seq": "ENCSR561FEE_P",
    "HepG2 DNAse": "ENCSR000EJV",
    "HepG2 H3k4me3": "ENCSR000AMP",
    "HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

out = ntv3_tracks(
    {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
    plot=True,
    tracks_to_plot=tracks_to_plot,
    elements_to_plot=elements_to_plot,
)</code></pre></div>
    <img src="assets/output_tracks.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
  </div>
</div>