File size: 4,451 Bytes
7d1c75c
88d1cd8
10addeb
 
 
a82ff3a
7d1c75c
 
 
1fb2a3c
10addeb
 
 
1fb2a3c
 
 
 
 
 
 
10addeb
 
 
1fb2a3c
 
 
 
 
 
10addeb
1fb2a3c
10addeb
 
 
 
 
1fb2a3c
10addeb
 
1fb2a3c
10addeb
1fb2a3c
 
 
10addeb
1fb2a3c
 
 
 
 
 
10addeb
 
1fb2a3c
 
 
10addeb
 
1fb2a3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10addeb
 
1fb2a3c
 
 
10addeb
 
1fb2a3c
10addeb
1fb2a3c
10addeb
1fb2a3c
10addeb
1fb2a3c
10addeb
1fb2a3c
 
 
10addeb
1fb2a3c
10addeb
 
 
 
 
 
 
 
 
 
1fb2a3c
10addeb
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: NTv3  Foundation Models for Long-Range Genomics
emoji: 🧬
colorFrom: indigo
colorTo: blue
sdk: static
pinned: false
---

# 🧬 NTv3 — Foundation Models for Long-Range Genomics

This Space is the companion hub for NTv3 checkpoints on the Hugging Face Hub. It provides PyTorch notebooks and minimal examples for inference, sequence-to-function prediction (functional tracks), genome annotation, fine-tuning, model interpretation and sequence generation.

## 📖 About NTv3

NTv3 is a multi-species genomic foundation model family that unifies representation learning, functional-track prediction, genome annotation, and controllable sequence generation within a single U-Net-style backbone. It models up to 1 Mb of DNA at single-base resolution, using a conv–Transformer–deconv architecture that efficiently captures both local motifs and long-range regulatory dependencies. NTv3 is first pretrained on ~9T base pairs from the OpenGenome2 corpus spanning >128k species using masked language modeling, and then post-trained with a joint objective on ~16k functional tracks and annotation labels across 24 animal and plant species, enabling state-of-the-art cross-species functional prediction and base-resolution genome annotation.

Beyond prediction, NTv3 can be fine-tuned into a controllable generative model via masked-diffusion language modeling, allowing targeted design of regulatory sequences (for example, enhancers with specified activity and promoter selectivity) that have been validated experimentally.

## 📓 Notebooks

Notebooks live in `./notebooks/`:

- 🚀 `00_quickstart_inference.ipynb` — load a checkpoint + run inference
- 📊 `01_tracks_prediction.ipynb` — sequence → functional tracks (+ plotting)
- 🏷️ `02_genome_annotation_segmentation.ipynb` — sequence → annotation
- 🎯 `03_finetune_head.ipynb` — fine-tune on bigwig tracks
- 🔍 `04_model_interpretation.ipynb` — interpretation of post-trained model
- 🧪 `05_sequence_generation.ipynb` — fine-tune NTv3 to generate enhancer sequences

## 📦 Install

```bash
pip install torch transformers accelerate safetensors huggingface_hub numpy
```

## 🤖 Load a pre-trained model

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

repo = "InstaDeepAI/NTv3_650M_pre"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)

batch = tok(["ATCGNATCG", "ACGT"], add_special_tokens=False, padding=True, pad_to_multiple_of=128, return_tensors="pt")
out = model(**batch, output_hidden_states=True, output_attentions=True)

print(out.logits.shape)       # (B, L, V = 11)
print(len(out.hidden_states)) # convs + transformers + deconvs
print(len(out.attentions))    # equals transformer layers = 12
```

## 💻 Pipelines

Here is a quick example of how to use the post-trained NTv3 650M model on a human genomic window.

```python
from transformers import AutoConfig

model_name = "InstaDeepAI/NTv3_100M"

# Load track prediction pipeline
cfg = AutoConfig.from_pretrained(model_name, trust_remote_code=True, force_download=True)
pipe = cfg.load_tracks_pipeline(model_name, device="auto")  # or "cpu"/"cuda"/"mps"

# Run track prediction
out = pipe(
    {
        "chrom": "chr19",
        "start": 6_700_000,
        "end": 6_831_072,
        "species": "human"
    }
)

print(out.bigwig_tracks_logits.shape)   # functional track predictions
print(out.bed_tracks_logits.shape)      # genome annotation predictions
print(out.mlm_logits.shape)             # MLM logits: (B, L, V = 11)
```

## 🤖 Checkpoints

**📦 Pre-trained:** `InstaDeepAI/NTv3_8M_pre`, `InstaDeepAI/NTv3_100M_pre`, `InstaDeepAI/NTv3_650M_pre`

**🎯 Post-trained:** `InstaDeepAI/NTv3_100M`, `InstaDeepAI/NTv3_650M`

## 🔗 Links

- **📄 Paper:** (add link)
- **💻 JAX research code (GitHub):** [https://github.com/instadeepai/nucleotide-transformer](https://github.com/instadeepai/nucleotide-transformer)
- **🏆 NTv3 benchmark leaderboard: (add link)**

## 📝 Citation

```bibtex
@article{ntv3,
  title   = {A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction},
  author  = {…},
  journal = {…},
  year    = {…}
}
```

## 📜 License

**Code & notebooks in this Space:** (choose and add, e.g., Apache-2.0)

**Model weights:** see the license specified in each model repository