Spaces:
Running
Running
File size: 4,451 Bytes
7d1c75c 88d1cd8 10addeb a82ff3a 7d1c75c 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb 1fb2a3c 10addeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
title: NTv3 — Foundation Models for Long-Range Genomics
emoji: 🧬
colorFrom: indigo
colorTo: blue
sdk: static
pinned: false
---
# 🧬 NTv3 — Foundation Models for Long-Range Genomics
This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation.
## 📖 About NTv3
NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.
Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.
## 📓 Notebooks
Notebooks live in `./notebooks/`:
- 🚀 `00_quickstart_inference.ipynb` — load a checkpoint + run inference
- 📊 `01_tracks_prediction.ipynb` — sequence → functional tracks (+ plotting)
- 🏷️ `02_genome_annotation_segmentation.ipynb` — sequence → annotation
- 🎯 `03_finetune_head.ipynb` — fine-tune on bigwig tracks
- 🔍 `04_model_interpretation.ipynb` — interpretation of post-trained model
- 🧪 `05_sequence_generation.ipynb` — fine-tune NTv3 to generate enhancer sequences
## 📦 Install
```bash
pip install torch transformers accelerate safetensors huggingface_hub numpy
```
## 🤖 Load a pre-trained model
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
repo = "InstaDeepAI/NTv3_650M_pre"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
out = model(**batch, output_hidden_states=True, output_attentions=True)
print(out.logits.shape) # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions)) # equals transformer layers = 12
```
## 💻 Pipelines
Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window.
```python
from transformers import AutoConfig
model_name = "InstaDeepAI/NTv3_100M"
# Load track prediction pipeline
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True)
pipe = cfg.load_tracks_pipeline(model_name, device="auto") # or "cpu"/"cuda"/"mps"
# Run track prediction
out = pipe(
{
"chrom": "chr19",
"start": 6_700_000,
"end": 6_831_072,
"species": "human"
}
)
print(out.bigwig_tracks_logits.shape) # functional track predictions
print(out.bed_tracks_logits.shape) # genome annotation predictions
print(out.mlm_logits.shape) # MLM logits: (B, L, V = 11)
```
## 🤖 Checkpoints
**📦 Pre-trained:** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre`
**🎯 Post-trained:** `InstaDeepAI/NTv3_100M`, `InstaDeepAI/NTv3_650M`
## 🔗 Links
- **📄 Paper:** (add link)
- **💻 JAX research code (GitHub):** [https://github.com/instadeepai/nucleotide-transformer](https://github.com/instadeepai/nucleotide-transformer)
- **🏆 NTv3 benchmark leaderboard: (add link)**
## 📝 Citation
```bibtex
@article{ntv3,
title = {A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction},
author = {…},
journal = {…},
year = {…}
}
```
## 📜 License
**Code & notebooks in this Space:** (choose and add, e.g., Apache-2.0)
**Model weights:** see the license specified in each model repository
|