File size: 9,945 Bytes
42f0385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4675225
42f0385
 
 
4675225
 
 
42f0385
 
 
 
4675225
 
42f0385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4675225
42f0385
4675225
 
42f0385
 
 
 
 
 
4675225
42f0385
4675225
de62c11
 
 
42f0385
 
 
 
 
 
4675225
 
 
 
42f0385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
<div class="summary">
  <h2>📖 About NTv3</h2>
  <p>
    NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
  </p>
  <p>
    Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
  </p>
</div>

<div class="paper-summary">
  <!-- <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2> -->
  <img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
</div>

<div class="why-ntv3">
  <h2>✨ Why NTv3?</h2>
  <ul>
    <li>📏 <strong>1 Mb long context at nucleotide resolution</strong> — ~100× longer than typical genomics models.</li>
    <li>🏗️ <strong>Unified architecture</strong> for: masked language modeling, functional-track prediction, genome annotation, and sequence generation.</li>
    <li>🌍 <strong>Cross-species generalization</strong> across 24 animals + plants with a shared conditioned representation space.</li>
    <li><strong>U-Net–style architecture</strong> improves stability and GPU efficiency on very long sequences.</li>
    <li>🎯 <strong>Controllable generative modeling</strong>, enabling targeted enhancer/promoter engineering validated by experimental assays.</li>
  </ul>
</div>

<div class="grid">
  <div class="card">
    <h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">collection</a>)</h2>
    <ul>
      <li>📦 Pretrained checkpoints:
        <div style="margin-top: 8px; margin-left: 0;">
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
        </div>
      </li>
      <li>🎯 Post-trained checkpoints:
        <div style="margin-top: 8px; margin-left: 0;">
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pos</code></a></div>
          <div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pos</code></a></div>
        </div>
      </li>
    </ul>
    <table>
      <thead>
        <tr>
          <th>Model</th>
          <th>Size</th>
          <th>Pre-training</th>
          <th>Post-training</th>
          <th>Tasks</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td><strong>NTv3-8M</strong></td>
          <td>8M params</td>
          <td>MLM</td>
          <td></td>
          <td>Embeddings, light inference</td>
        </tr>
        <tr>
          <td><strong>NTv3-100M</strong></td>
          <td>100M params</td>
          <td>MLM</td>
          <td><span class="checkmark"></span></td>
          <td>Tracks, annotation</td>
        </tr>
        <tr>
          <td><strong>NTv3-650M</strong></td>
          <td>650M params</td>
          <td>MLM</td>
          <td><span class="checkmark"></span></td>
          <td>Tracks, annotation, best accuracy</td>
        </tr>
      </tbody>
    </table>
  </div>

  <div class="card-stack">
    <div class="card">
      <h2>📓 Tutorial notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
      <ul>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
        <li>🎯 02 — Fine-tune on bigwig tracks</li>
        <li>🔍 03 — Model interpretation</li>
        <li>🧪 04 — Training NTv3 generative </li>
      </ul>
    </div>
    <div class="card">
      <h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
      <ul>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
        <li>🎯 02 — Fine-tune on bigwig tracks</li>
        <li>🔍 03 — Interpret a given genomic region</li>
        <li>🧪 04 — Sequence generation</li>
      </ul>
    </div>
    <div class="card">
      <h2>🔗 Links</h2>
      <ul>
        <li>📄 Paper: (add link)</li>
        <li><a href="https://github.com/instadeepai/nucleotide-transformer" target="_blank" rel="noopener noreferrer">💻 JAX model code (GitHub)</a></li>
        <li><a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">🎯 HF Model Collection (all NTv3 models)</a></li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">📚 Tutorial </a> and <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">🔧 Pipeline</a> notebooks</li>
        <li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3_benchmark" target="_blank" rel="noopener noreferrer">🏆 NTv3 benchmark leaderboard</a></li>
      </ul>
    </div>
  </div>

  <div class="card">
    <h2>🤖 Load a pre-trained model</h2>
    <p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
    <div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "InstaDeepAI/NTv3_650M_pre"

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")

# Run model
out = model(
  **batch,
  output_hidden_states=True,
  output_attentions=True
)

# Print output shapes
print(out.logits.shape)       # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions))    # equals transformer layers = 12
</code></pre></div>
    <p>Model embeddings can be used for fine-tuning on downstream tasks.</p>

    <p style="margin-top: 40px;">TO DO: add pipeline for fine-tuning on functional tracks or genome annotation.</p>
  </div>
  
  <div class="card">
    <h2>💻 Use a post-trained model</h2>
    <p>Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.</p>
    <div class="code"><pre><code class="language-python">from transformers import pipeline
import torch

model_name = "InstaDeepAI/NTv3_650M_pos"

ntv3_tracks = pipeline(
    "ntv3-tracks",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Run track prediction
out = ntv3_tracks(
  {
    "chrom": "chr19",
    "start": 6_700_000,
    "end": 6_831_072,
    "species": "human"
  }
)

# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))</code></pre></div>
    <p>Predictions can also be plotted for a subset of functional tracks and genomic elements:</p>
    <div class="code"><pre><code class="language-python">tracks_to_plot = {
    "K562 RNA-seq": "ENCSR056HPM",
    "K562 DNAse": "ENCSR921NMD",
    "K562 H3k4me3": "ENCSR000DWD",
    "K562 CTCF": "ENCSR000AKO",
    "HepG2 RNA-seq": "ENCSR561FEE_P",
    "HepG2 DNAse": "ENCSR000EJV",
    "HepG2 H3k4me3": "ENCSR000AMP",
    "HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

out = ntv3_tracks(
    {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
    plot=True,
    tracks_to_plot=tracks_to_plot,
    elements_to_plot=elements_to_plot,
)</code></pre></div>
    <img src="assets/output_tracks.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
  </div>
</div>