Spaces:
Running
Running
File size: 12,652 Bytes
42f0385 7671682 42f0385 4675225 42f0385 4675225 42f0385 4675225 42f0385 421ebe7 42f0385 421ebe7 42f0385 421ebe7 d7bbd13 42f0385 421ebe7 d7bbd13 42f0385 4675225 42f0385 4675225 3b6e7d5 971caa1 0bf1172 42f0385 4675225 42f0385 4675225 ee00347 42f0385 31444bd 4675225 42f0385 3b6e7d5 42f0385 9759882 42f0385 3b6e7d5 75576cf 3b6e7d5 42f0385 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
<div class="summary">
<h2>📖 About NTv3</h2>
<p>
NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
</p>
<p>
NTv3 also acts as a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
</p>
</div>
<div class="paper-summary">
<!-- <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2> -->
<img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
</div>
<div class="why-ntv3">
<h2>✨ Why NTv3?</h2>
<ul>
<li>📏 <strong>1 Mb long context at nucleotide resolution</strong> — ~100× longer than typical genomics models.</li>
<li>🏗️ <strong>Unified architecture</strong> for: masked language modeling, functional-track prediction, genome annotation, and sequence generation.</li>
<li>🌍 <strong>Cross-species generalization</strong> across 24 animals + plants with a shared conditioned representation space.</li>
<li>⚡ <strong>U-Net–style architecture</strong> improves stability and GPU efficiency on very long sequences.</li>
<li>🎯 <strong>Controllable generative modeling</strong>, enabling targeted enhancer/promoter engineering validated by experimental assays.</li>
</ul>
</div>
<div class="grid">
<div class="card">
<h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">collection</a>)</h2>
<ul>
<li>📦 Pretrained checkpoints:
<div style="margin-top: 8px; margin-left: 0;">
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
</div>
</li>
<li>🎯 Post-trained checkpoints:
<div style="margin-top: 8px; margin-left: 0;">
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pos</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pos</code></a></div>
</div>
</li>
</ul>
<table>
<thead>
<tr>
<th>Model</th>
<th>Size</th>
<th>Pre-training</th>
<th>Post-training</th>
<th>Usage</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NTv3-8M</strong></td>
<td>8M params</td>
<td><span class="checkmark">✅</span></td>
<td>❌</td>
<td>Embeddings, light inference</td>
</tr>
<tr>
<td><strong>NTv3-100M</strong></td>
<td>100M params</td>
<td><span class="checkmark">✅</span></td>
<td><span class="checkmark">✅</span></td>
<td>Embeddings, tracks, annotation</td>
</tr>
<tr>
<td><strong>NTv3-650M</strong></td>
<td>650M params</td>
<td><span class="checkmark">✅</span></td>
<td><span class="checkmark">✅</span></td>
<td>Embeddings, tracks, annotation, best accuracy</td>
</tr>
</tbody>
</table>
</div>
<div class="card-stack">
<div class="card">
<h2>📓 Tutorial notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
<ul>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb" target="_blank" rel="noopener noreferrer">🏷️ 04 — Fine-tune a pre-trained model on annotations</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/05_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 05 — Model interpretation</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/06_NTv3_generative_training.ipynb" target="_blank" rel="noopener noreferrer">🧪 06 — Fine-tuning NTv3 into a diffusion model</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/07_enhancer_generation.ipynb" target="_blank" rel="noopener noreferrer">🪰 07 — Generating enhancer sequences</a></li>
</ul>
</div>
<div class="card">
<h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
<ul>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 02 — Interpret a given genomic region</a></li>
</ul>
</div>
<div class="card">
<h2>🔗 Links</h2>
<ul>
<li><a href="https://www.biorxiv.org/content/10.64898/2025.12.22.695963v1" target="_blank" rel="noopener noreferrer">📄 Paper</a></li>
<li><a href="https://github.com/instadeepai/nucleotide-transformer" target="_blank" rel="noopener noreferrer">💻 JAX model code (GitHub)</a></li>
<li><a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">🎯 HF Model Collection (all NTv3 models)</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">📚 Tutorial </a> and <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">🔧 Pipeline</a> notebooks</li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3_benchmark" target="_blank" rel="noopener noreferrer">🏆 NTv3 benchmark leaderboard</a></li>
</ul>
</div>
</div>
<div class="card-stack">
<div class="card">
<h2>🤖 Load a pre-trained model</h2>
<p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
<div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name = "InstaDeepAI/NTv3_650M_pre"
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
# Run model
out = model(**batch)
# Print output shapes
print(out.logits.shape) # (B, L, V = 11)
</code></pre></div>
<p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
</div>
<div class="card">
<h2>🔍 Model interpretation</h2>
<p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
<div class="code"><pre><code class="language-python">from transformers import pipeline
import torch
import matplotlib.pyplot as plt
model_name = "InstaDeepAI/NTv3_650M_post"
# Build interpretation pipeline
ntv3_interpret = pipeline(
"ntv3-interpret",
model=model_name,
trust_remote_code=True,
device=0 if torch.cuda.is_available() else -1,
)
# Run interpretation on a given genomic region with tracks, annotations, attention, and saliency
result = ntv3_interpret(
{"chrom": "chr11", "start": 5_253_561, "end": 5_286_329, "species": "human"},
output_attention=True,
output_saliency=True,
saliency_track_id="ENCSR000EFT", # K562 GATA1 ChIP-seq
plot=True, # plot predictons on tracks and annotations
tracks_to_plot={"K562 RNA-seq": "ENCSR056HPM", "K562 GATA1": "ENCSR000EFT"},
elements_to_plot=["exon", "promoter_Tissue_specific"],
)
# Access attention map results
result.plot_attention() # attention map (last layer)
plt.show()
# Access saliency scores results
result.plot_saliency(window_size=128)
plt.show()
</code></pre></div>
<img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
</div>
</div>
<div class="card">
<h2>💻 Use a post-trained model</h2>
<p>Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.</p>
<div class="code"><pre><code class="language-python">from transformers import pipeline
import torch
model_name = "InstaDeepAI/NTv3_650M_pos"
ntv3_tracks = pipeline(
"ntv3-tracks",
model=model_name,
trust_remote_code=True,
device=0 if torch.cuda.is_available() else -1,
)
# Run track prediction
out = ntv3_tracks(
{
"chrom": "chr19",
"start": 6_700_000,
"end": 6_831_072,
"species": "human"
}
)
# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))</code></pre></div>
<p>Predictions can also be plotted for a subset of functional tracks and genomic elements:</p>
<div class="code"><pre><code class="language-python">tracks_to_plot = {
"K562 RNA-seq": "ENCSR056HPM",
"K562 DNAse": "ENCSR921NMD",
"K562 H3k4me3": "ENCSR000DWD",
"K562 CTCF": "ENCSR000AKO",
"HepG2 RNA-seq": "ENCSR561FEE_P",
"HepG2 DNAse": "ENCSR000EJV",
"HepG2 H3k4me3": "ENCSR000AMP",
"HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]
out = ntv3_tracks(
{"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
plot=True,
tracks_to_plot=tracks_to_plot,
elements_to_plot=elements_to_plot,
)</code></pre></div>
<img src="assets/output_tracks.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
</div>
</div>
|