🤖 Models (see collection)

📦 Pretrained checkpoints:

InstaDeepAI/NTv3_8M_pre

InstaDeepAI/NTv3_100M_pre

InstaDeepAI/NTv3_650M_pre
🎯 Post-trained checkpoints:

InstaDeepAI/NTv3_100M_pos

InstaDeepAI/NTv3_650M_pos

Model	Size	Pre-training	Post-training	Usage
NTv3-8M	8M params	✅	❌	Embeddings, light inference
NTv3-100M	100M params	✅	✅	Embeddings, tracks, annotation
NTv3-650M	650M params	✅	✅	Embeddings, tracks, annotation, best accuracy

📓 Tutorial notebooks (browse folder)

📓 Pipeline notebooks (browse folder)

🔗 Links

🤖 Load a pre-trained model

Here is an example of how to load and use a pre-trained NTv3 model.

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "InstaDeepAI/NTv3_650M_pre"

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input sequences
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")

# Run model
out = model(**batch)

# Print output shapes
print(out.logits.shape)       # (B, L, V = 11)

Model embeddings can be used for fine-tuning on downstream tasks.

🔍 Model interpretation

Here is an example of how to use the interpretation pipeline on the NTv3 post-trained model for multi-scale analysis of DNA sequences:

from transformers import pipeline
import torch
import matplotlib.pyplot as plt

model_name = "InstaDeepAI/NTv3_650M_post"

# Build interpretation pipeline
ntv3_interpret = pipeline(
    "ntv3-interpret",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Run interpretation on a given genomic region with tracks, annotations, attention, and saliency
result = ntv3_interpret(
    {"chrom": "chr11", "start": 5_253_561, "end": 5_286_329, "species": "human"},
    output_attention=True,
    output_saliency=True,
    saliency_track_id="ENCSR000EFT",  # K562 GATA1 ChIP-seq
    plot=True,  # plot predictons on tracks and annotations
    tracks_to_plot={"K562 RNA-seq": "ENCSR056HPM", "K562 GATA1": "ENCSR000EFT"},
    elements_to_plot=["exon", "promoter_Tissue_specific"],
)

# Access attention map results
result.plot_attention()  # attention map (last layer)
plt.show()

# Access saliency scores results
result.plot_saliency(window_size=128)
plt.show()

💻 Use a post-trained model

Here is a quick example of how to use the post-trained NTv3 650M model to predict tracks for a human genomic window.

from transformers import pipeline
import torch

model_name = "InstaDeepAI/NTv3_650M_pos"

ntv3_tracks = pipeline(
    "ntv3-tracks",
    model=model_name,
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1,
)

# Run track prediction
out = ntv3_tracks(
  {
    "chrom": "chr19",
    "start": 6_700_000,
    "end": 6_831_072,
    "species": "human"
  }
)

# Print output shapes
# 7k human tracks over 37.5 % center region of the input sequence
print("bigwig_tracks_logits:", tuple(out.bigwig_tracks_logits.shape))
# Location of 21 genomic elements over 37.5 % center region of the input sequence
print("bed_tracks_logits:", tuple(out.bed_tracks_logits.shape))
# Language model logits for whole sequence over vocabulary
print("language model logits:", tuple(out.mlm_logits.shape))

Predictions can also be plotted for a subset of functional tracks and genomic elements:

tracks_to_plot = {
    "K562 RNA-seq": "ENCSR056HPM",
    "K562 DNAse": "ENCSR921NMD",
    "K562 H3k4me3": "ENCSR000DWD",
    "K562 CTCF": "ENCSR000AKO",
    "HepG2 RNA-seq": "ENCSR561FEE_P",
    "HepG2 DNAse": "ENCSR000EJV",
    "HepG2 H3k4me3": "ENCSR000AMP",
    "HepG2 CTCF": "ENCSR000BIE",
}
elements_to_plot = ["protein_coding_gene", "exon", "intron", "splice_donor", "splice_acceptor"]

out = ntv3_tracks(
    {"chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human"},
    plot=True,
    tracks_to_plot=tracks_to_plot,
    elements_to_plot=elements_to_plot,
)

📖 About NTv3