Spaces:
Running
title: NTv3 — Foundation Models for Long-Range Genomics
emoji: 🧬
colorFrom: indigo
colorTo: blue
sdk: static
pinned: false
🧬 NTv3 — Foundation Models for Long-Range Genomics
This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation.
📖 About NTv3
NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
📓 Notebooks
Notebooks live in ./notebooks/:
- 🚀
00_quickstart_inference.ipynb— load a checkpoint + run inference - 📊
01_tracks_prediction.ipynb— sequence → functional tracks (+ plotting) - 🏷️
02_genome_annotation_segmentation.ipynb— sequence → annotation - 🎯
03_finetune_head.ipynb— fine-tune on bigwig tracks - 🔍
04_model_interpretation.ipynb— interpretation of post-trained model - 🧪
05_sequence_generation.ipynb— fine-tune NTv3 to generate enhancer sequences
📦 Install
pip install torch transformers accelerate safetensors huggingface_hub numpy
🤖 Load a pre-trained model
from transformers import AutoTokenizer, AutoModelForMaskedLM
repo = "InstaDeepAI/NTv3_650M_pre"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
out = model(**batch, output_hidden_states=True, output_attentions=True)
print(out.logits.shape) # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions)) # equals transformer layers = 12
💻 Pipelines
Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window.
from transformers import AutoConfig
model_name = "InstaDeepAI/NTv3_100M"
# Load track prediction pipeline
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True)
pipe = cfg.load_tracks_pipeline(model_name, device="auto") # or "cpu"/"cuda"/"mps"
# Run track prediction
out = pipe(
{
"chrom": "chr19",
"start": 6_700_000,
"end": 6_831_072,
"species": "human"
}
)
print(out.bigwig_tracks_logits.shape) # functional track predictions
print(out.bed_tracks_logits.shape) # genome annotation predictions
print(out.mlm_logits.shape) # MLM logits: (B, L, V = 11)
🤖 Checkpoints
📦 Pre-trained: InstaDeepAI/NTv3_8M_pre, InstaDeepAI/NTv3_100M_pre, InstaDeepAI/NTv3_650M_pre
🎯 Post-trained: InstaDeepAI/NTv3_100M, InstaDeepAI/NTv3_650M
🔗 Links
- 📄 Paper: (add link)
- 💻 JAX research code (GitHub): https://github.com/instadeepai/nucleotide-transformer
- 🏆 NTv3 benchmark leaderboard: (add link)
📝 Citation
@article{ntv3,
title = {A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction},
author = {…},
journal = {…},
year = {…}
}
📜 License
Code & notebooks in this Space: (choose and add, e.g., Apache-2.0)
Model weights: see the license specified in each model repository