NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a convβTransformerβdeconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
| Model | Size | Pre-training | Post-training | Tasks |
|---|---|---|---|---|
| NTv3-8M | 8M params | MLM | β | Embeddings, light inference |
| NTv3-100M | 100M params | MLM | β | Tracks, annotation |
| NTv3-650M | 650M params | MLM | β | Tracks, annotation, best accuracy |
Here is an example of how to load and use a pre-trained NTv3 model.
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name = "InstaDeepAI/NTv3_650M_pre"
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
# Run model
out = model(
**batch,
output_hidden_states=True,
output_attentions=True
)
# Print output shapes
print(out.logits.shape) # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions)) # equals transformer layers = 12
Model embeddings can be used for fine-tuning on downstream tasks.
TO DO: add pipeline for fine-tuning on functional tracks or genome annotation.
Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.
from transformers import pipeline
import torch
model_name = "InstaDeepAI/NTv3_650M_pos"
ntv3_tracks = pipeline(
"ntv3-tracks",
model=model_name,
trust_remote_code=True,
device=0 if torch.cuda.is_available() else -1,
)
# Run track prediction
out = ntv3_tracks(
{
"chrom": "chr19",
"start": 6_700_000,
"end": 6_831_072,
"species": "human"
}
)
# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))Predictions can also be plotted for a subset of functional tracks and genomic elements:
tracks_to_plot = {
"K562 RNA-seq": "ENCSR056HPM",
"K562 DNAse": "ENCSR921NMD",
"K562 H3k4me3": "ENCSR000DWD",
"K562 CTCF": "ENCSR000AKO",
"HepG2 RNA-seq": "ENCSR561FEE_P",
"HepG2 DNAse": "ENCSR000EJV",
"HepG2 H3k4me3": "ENCSR000AMP",
"HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]
out = ntv3_tracks(
{"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
plot=True,
tracks_to_plot=tracks_to_plot,
elements_to_plot=elements_to_plot,
)