Spaces:

InstaDeepAI
/

ntv3

Running

App Files Files Community

ntv3 / tabs /home.html

bernardo-de-almeida

update notebook readme

971caa1 4 months ago

raw

history blame contribute delete

12.7 kB

	<div class="summary">
	<h2>📖 About NTv3</h2>
	<p>
	NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
	</p>
	<p>
	NTv3 also acts as a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
	</p>
	</div>

	<div class="paper-summary">
	<!-- <h2>📄 A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction</h2> -->
	<img src="assets/paper_summary.png" alt="NTv3 Paper Summary" />
	</div>

	<div class="why-ntv3">
	<h2>✨ Why NTv3?</h2>
	<ul>
	<li>📏 <strong>1 Mb long context at nucleotide resolution</strong> — ~100× longer than typical genomics models.</li>
	<li>🏗️ <strong>Unified architecture</strong> for: masked language modeling, functional-track prediction, genome annotation, and sequence generation.</li>
	<li>🌍 <strong>Cross-species generalization</strong> across 24 animals + plants with a shared conditioned representation space.</li>
	<li>⚡ <strong>U-Net–style architecture</strong> improves stability and GPU efficiency on very long sequences.</li>
	<li>🎯 <strong>Controllable generative modeling</strong>, enabling targeted enhancer/promoter engineering validated by experimental assays.</li>
	</ul>
	</div>

	<div class="grid">
	<div class="card">
	<h2>🤖 Models (see <a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">collection</a>)</h2>
	<ul>
	<li>📦 Pretrained checkpoints:
	<div style="margin-top: 8px; margin-left: 0;">
	<div><a href="https://huggingface.co/InstaDeepAI/NTv3_8M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_8M_pre</code></a></div>
	<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pre</code></a></div>
	<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pre" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pre</code></a></div>
	</div>
	</li>
	<li>🎯 Post-trained checkpoints:
	<div style="margin-top: 8px; margin-left: 0;">
	<div><a href="https://huggingface.co/InstaDeepAI/NTv3_100M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_100M_pos</code></a></div>
	<div><a href="https://huggingface.co/InstaDeepAI/NTv3_650M_pos" target="_blank" rel="noopener noreferrer"><code>InstaDeepAI/NTv3_650M_pos</code></a></div>
	</div>
	</li>
	</ul>
	<table>
	<thead>
	<tr>
	<th>Model</th>
	<th>Size</th>
	<th>Pre-training</th>
	<th>Post-training</th>
	<th>Usage</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>NTv3-8M</strong></td>
	<td>8M params</td>
	<td><span class="checkmark">✅</span></td>
	<td>❌</td>
	<td>Embeddings, light inference</td>
	</tr>
	<tr>
	<td><strong>NTv3-100M</strong></td>
	<td>100M params</td>
	<td><span class="checkmark">✅</span></td>
	<td><span class="checkmark">✅</span></td>
	<td>Embeddings, tracks, annotation</td>
	</tr>
	<tr>
	<td><strong>NTv3-650M</strong></td>
	<td>650M params</td>
	<td><span class="checkmark">✅</span></td>
	<td><span class="checkmark">✅</span></td>
	<td>Embeddings, tracks, annotation, best accuracy</td>
	</tr>
	</tbody>
	</table>
	</div>

	<div class="card-stack">
	<div class="card">
	<h2>📓 Tutorial notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
	<ul>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/00_quickstart_inference.ipynb" target="_blank" rel="noopener noreferrer">🚀 00 — Quickstart inference</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/01_tracks_prediction.ipynb" target="_blank" rel="noopener noreferrer">📊 01 — Tracks prediction</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/02_fine_tuning_pretrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 02 — Fine-tune a pre-trained model on bigwig tracks</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/03_fine_tuning_posttrained_model_biwig.ipynb" target="_blank" rel="noopener noreferrer">🎯 03 — Fine-tune a post-trained model on bigwig tracks</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/04_fine_tuning_pretrained_model_annotation.ipynb" target="_blank" rel="noopener noreferrer">🏷️ 04 — Fine-tune a pre-trained model on annotations</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/05_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 05 — Model interpretation</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/06_NTv3_generative_training.ipynb" target="_blank" rel="noopener noreferrer">🧪 06 — Fine-tuning NTv3 into a diffusion model</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_tutorials/07_enhancer_generation.ipynb" target="_blank" rel="noopener noreferrer">🪰 07 — Generating enhancer sequences</a></li>
	</ul>
	</div>
	<div class="card">
	<h2>📓 Pipeline notebooks (browse <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">folder</a>)</h2>
	<ul>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/01_functional_track_prediction.ipynb" target="_blank" rel="noopener noreferrer">🎯 01 — Generate bigwig predictions for certain tracks</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/blob/main/notebooks_pipelines/02_model_interpretation.ipynb" target="_blank" rel="noopener noreferrer">🔍 02 — Interpret a given genomic region</a></li>
	</ul>
	</div>
	<div class="card">
	<h2>🔗 Links</h2>
	<ul>
	<li><a href="https://www.biorxiv.org/content/10.64898/2025.12.22.695963v1" target="_blank" rel="noopener noreferrer">📄 Paper</a></li>
	<li><a href="https://github.com/instadeepai/nucleotide-transformer" target="_blank" rel="noopener noreferrer">💻 JAX model code (GitHub)</a></li>
	<li><a href="https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-v3" target="_blank" rel="noopener noreferrer">🎯 HF Model Collection (all NTv3 models)</a></li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_tutorials" target="_blank" rel="noopener noreferrer">📚 Tutorial </a> and <a href="https://huggingface.co/spaces/InstaDeepAI/ntv3/tree/main/notebooks_pipelines" target="_blank" rel="noopener noreferrer">🔧 Pipeline</a> notebooks</li>
	<li><a href="https://huggingface.co/spaces/InstaDeepAI/ntv3_benchmark" target="_blank" rel="noopener noreferrer">🏆 NTv3 benchmark leaderboard</a></li>
	</ul>
	</div>
	</div>

	<div class="card-stack">
	<div class="card">
	<h2>🤖 Load a pre-trained model</h2>
	<p>Here is an example of how to load and use a pre-trained NTv3 model.</p>
	<div class="code"><pre><code class="language-python">from transformers import AutoTokenizer, AutoModelForMaskedLM

	model_name = "InstaDeepAI/NTv3_650M_pre"

	# Load model and tokenizer
	model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
	tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	# Tokenize input sequences
	batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")

	# Run model
	out = model(**batch)

	# Print output shapes
	print(out.logits.shape) # (B, L, V = 11)
	</code></pre></div>
	<p>Model embeddings can be used for fine-tuning on downstream tasks.</p>
	</div>
	<div class="card">
	<h2>🔍 Model interpretation</h2>
	<p>Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:</p>
	<div class="code"><pre><code class="language-python">from transformers import pipeline
	import torch
	import matplotlib.pyplot as plt

	model_name = "InstaDeepAI/NTv3_650M_post"

	# Build interpretation pipeline
	ntv3_interpret = pipeline(
	"ntv3-interpret",
	model=model_name,
	trust_remote_code=True,
	device=0 if torch.cuda.is_available() else -1,
	)

	# Run interpretation on a given genomic region with tracks, annotations, attention, and saliency
	result = ntv3_interpret(
	{"chrom": "chr11", "start": 5_253_561, "end": 5_286_329, "species": "human"},
	output_attention=True,
	output_saliency=True,
	saliency_track_id="ENCSR000EFT", # K562 GATA1 ChIP-seq
	plot=True, # plot predictons on tracks and annotations
	tracks_to_plot={"K562 RNA-seq": "ENCSR056HPM", "K562 GATA1": "ENCSR000EFT"},
	elements_to_plot=["exon", "promoter_Tissue_specific"],
	)

	# Access attention map results
	result.plot_attention() # attention map (last layer)
	plt.show()

	# Access saliency scores results
	result.plot_saliency(window_size=128)
	plt.show()
	</code></pre></div>
	<img src="assets/saliency_example.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
	</div>
	</div>

	<div class="card">
	<h2>💻 Use a post-trained model</h2>
	<p>Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.</p>
	<div class="code"><pre><code class="language-python">from transformers import pipeline
	import torch

	model_name = "InstaDeepAI/NTv3_650M_pos"

	ntv3_tracks = pipeline(
	"ntv3-tracks",
	model=model_name,
	trust_remote_code=True,
	device=0 if torch.cuda.is_available() else -1,
	)

	# Run track prediction
	out = ntv3_tracks(
	{
	"chrom": "chr19",
	"start": 6_700_000,
	"end": 6_831_072,
	"species": "human"
	}
	)

	# Print output shapes
	# 7k human tracks over 37.5 % center region of the input sequence
	print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
	# Location of 21 genomic elements over 37.5 % center region of the input sequence
	print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
	# Language model logits for whole sequence over vocabulary
	print("language model logits:", tuple(out.mlm_logits.shape))</code></pre></div>
	<p>Predictions can also be plotted for a subset of functional tracks and genomic elements:</p>
	<div class="code"><pre><code class="language-python">tracks_to_plot = {
	"K562 RNA-seq": "ENCSR056HPM",
	"K562 DNAse": "ENCSR921NMD",
	"K562 H3k4me3": "ENCSR000DWD",
	"K562 CTCF": "ENCSR000AKO",
	"HepG2 RNA-seq": "ENCSR561FEE_P",
	"HepG2 DNAse": "ENCSR000EJV",
	"HepG2 H3k4me3": "ENCSR000AMP",
	"HepG2 CTCF": "ENCSR000BIE",
	}
	elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

	out = ntv3_tracks(
	{"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
	plot=True,
	tracks_to_plot=tracks_to_plot,
	elements_to_plot=elements_to_plot,
	)</code></pre></div>
	<img src="assets/output_tracks.png" alt="Output tracks visualization" style="max-width: 100%; margin-top: 20px;" />
	</div>
	</div>