--- title: NTv3 β€” Foundation Models for Long-Range Genomics emoji: 🧬 colorFrom: indigo colorTo: blue sdk: static pinned: false --- # 🧬 NTv3 β€” Foundation Models for Long-Range Genomics This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation. ## πŸ“– About NTv3 NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation. Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally. ## πŸ““ Notebooks Notebooks live in `./notebooks/`: - πŸš€ `00_quickstart_inference.ipynb` β€” load a checkpoint + run inference - πŸ“Š `01_tracks_prediction.ipynb` β€” sequence β†’ functional tracks (+ plotting) - 🏷️ `02_genome_annotation_segmentation.ipynb` β€” sequence β†’ annotation - 🎯 `03_finetune_head.ipynb` β€” fine-tune on bigwig tracks - πŸ” `04_model_interpretation.ipynb` β€” interpretation of post-trained model - πŸ§ͺ `05_sequence_generation.ipynb` β€” fine-tune NTv3 to generate enhancer sequences ## πŸ“¦ Install ```bash pip install torch transformers accelerate safetensors huggingface_hub numpy ``` ## πŸ€– Load a pre-trained model ```python from transformers import AutoTokenizer, AutoModelForMaskedLM repo = "InstaDeepAI/NTv3_650M_pre" tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True) batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt") out = model(**batch, output_hidden_states=True, output_attentions=True) print(out.logits.shape) # (B, L, V = 11) print(len(out.hidden_states)) # convs + transformers + deconvs print(len(out.attentions)) # equals transformer layers = 12 ``` ## πŸ’» Pipelines Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window. ```python from transformers import AutoConfig model_name = "InstaDeepAI/NTv3_100M" # Load track prediction pipeline cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True) pipe = cfg.load_tracks_pipeline(model_name, device="auto") # or "cpu"/"cuda"/"mps" # Run track prediction out = pipe( { "chrom": "chr19", "start": 6_700_000, "end": 6_831_072, "species": "human" } ) print(out.bigwig_tracks_logits.shape) # functional track predictions print(out.bed_tracks_logits.shape) # genome annotation predictions print(out.mlm_logits.shape) # MLM logits: (B, L, V = 11) ``` ## πŸ€– Checkpoints **πŸ“¦ Pre-trained:** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre` **🎯 Post-trained:** `InstaDeepAI/NTv3_100M`, `InstaDeepAI/NTv3_650M` ## πŸ”— Links - **πŸ“„ Paper:** (add link) - **πŸ’» JAX research code (GitHub):** [https://github.com/instadeepai/nucleotide-transformer](https://github.com/instadeepai/nucleotide-transformer) - **πŸ† NTv3 benchmark leaderboard: (add link)** ## πŸ“ Citation ```bibtex @article{ntv3, title = {A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction}, author = {…}, journal = {…}, year = {…} } ``` ## πŸ“œ License **Code & notebooks in this Space:** (choose and add, e.g., Apache-2.0) **Model weights:** see the license specified in each model repository