Spaces:
Running
Running
File size: 9,945 Bytes
42f0385 4675225 42f0385 4675225 42f0385 4675225 42f0385 4675225 42f0385 4675225 42f0385 4675225 42f0385 4675225 de62c11 42f0385 4675225 42f0385 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | <div class="summary">
<h2>📖 About NTv3</h2>
<p>
NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
</p>
<p>
Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
</p>
</div>
<div class="paper-summary">
<!-- <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2> -->
<img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
</div>
<div class="why-ntv3">
<h2>✨ Why NTv3?</h2>
<ul>
<li>📏 <strong>1 Mb long context at nucleotide resolution</strong> — ~100× longer than typical genomics models.</li>
<li>🏗️ <strong>Unified architecture</strong> for: masked language modeling, functional-track prediction, genome annotation, and sequence generation.</li>
<li>🌍 <strong>Cross-species generalization</strong> across 24 animals + plants with a shared conditioned representation space.</li>
<li>⚡ <strong>U-Net–style architecture</strong> improves stability and GPU efficiency on very long sequences.</li>
<li>🎯 <strong>Controllable generative modeling</strong>, enabling targeted enhancer/promoter engineering validated by experimental assays.</li>
</ul>
</div>
<div class="grid">
<div class="card">
<h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">collection</a>)</h2>
<ul>
<li>📦 Pretrained checkpoints:
<div style="margin-top: 8px; margin-left: 0;">
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
</div>
</li>
<li>🎯 Post-trained checkpoints:
<div style="margin-top: 8px; margin-left: 0;">
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pos</code></a></div>
<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pos</code></a></div>
</div>
</li>
</ul>
<table>
<thead>
<tr>
<th>Model</th>
<th>Size</th>
<th>Pre-training</th>
<th>Post-training</th>
<th>Tasks</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NTv3-8M</strong></td>
<td>8M params</td>
<td>MLM</td>
<td>❌</td>
<td>Embeddings, light inference</td>
</tr>
<tr>
<td><strong>NTv3-100M</strong></td>
<td>100M params</td>
<td>MLM</td>
<td><span class="checkmark">✅</span></td>
<td>Tracks, annotation</td>
</tr>
<tr>
<td><strong>NTv3-650M</strong></td>
<td>650M params</td>
<td>MLM</td>
<td><span class="checkmark">✅</span></td>
<td>Tracks, annotation, best accuracy</td>
</tr>
</tbody>
</table>
</div>
<div class="card-stack">
<div class="card">
<h2>📓 Tutorial notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
<ul>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
<li>🎯 02 — Fine-tune on bigwig tracks</li>
<li>🔍 03 — Model interpretation</li>
<li>🧪 04 — Training NTv3 generative </li>
</ul>
</div>
<div class="card">
<h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
<ul>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
<li>🎯 02 — Fine-tune on bigwig tracks</li>
<li>🔍 03 — Interpret a given genomic region</li>
<li>🧪 04 — Sequence generation</li>
</ul>
</div>
<div class="card">
<h2>🔗 Links</h2>
<ul>
<li>📄 Paper: (add link)</li>
<li><a href="https://github.com/instadeepai/nucleotide-transformer" target="_blank" rel="noopener noreferrer">💻 JAX model code (GitHub)</a></li>
<li><a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">🎯 HF Model Collection (all NTv3 models)</a></li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">📚 Tutorial </a> and <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">🔧 Pipeline</a> notebooks</li>
<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3_benchmark" target="_blank" rel="noopener noreferrer">🏆 NTv3 benchmark leaderboard</a></li>
</ul>
</div>
</div>
<div class="card">
<h2>🤖 Load a pre-trained model</h2>
<p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
<div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name = "InstaDeepAI/NTv3_650M_pre"
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
# Run model
out = model(
**batch,
output_hidden_states=True,
output_attentions=True
)
# Print output shapes
print(out.logits.shape) # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions)) # equals transformer layers = 12
</code></pre></div>
<p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
<p style="margin-top: 40px;">TO DO: add pipeline for fine-tuning on functional tracks or genome annotation.</p>
</div>
<div class="card">
<h2>💻 Use a post-trained model</h2>
<p>Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.</p>
<div class="code"><pre><code class="language-python">from transformers import pipeline
import torch
model_name = "InstaDeepAI/NTv3_650M_pos"
ntv3_tracks = pipeline(
"ntv3-tracks",
model=model_name,
trust_remote_code=True,
device=0 if torch.cuda.is_available() else -1,
)
# Run track prediction
out = ntv3_tracks(
{
"chrom": "chr19",
"start": 6_700_000,
"end": 6_831_072,
"species": "human"
}
)
# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))</code></pre></div>
<p>Predictions can also be plotted for a subset of functional tracks and genomic elements:</p>
<div class="code"><pre><code class="language-python">tracks_to_plot = {
"K562 RNA-seq": "ENCSR056HPM",
"K562 DNAse": "ENCSR921NMD",
"K562 H3k4me3": "ENCSR000DWD",
"K562 CTCF": "ENCSR000AKO",
"HepG2 RNA-seq": "ENCSR561FEE_P",
"HepG2 DNAse": "ENCSR000EJV",
"HepG2 H3k4me3": "ENCSR000AMP",
"HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]
out = ntv3_tracks(
{"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
plot=True,
tracks_to_plot=tracks_to_plot,
elements_to_plot=elements_to_plot,
)</code></pre></div>
<img src="assets/output_tracks.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
</div>
</div>
|