|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- genomics |
|
|
- dna |
|
|
- generative |
|
|
- ntv3 |
|
|
- enhancer-generation |
|
|
- mdlm |
|
|
- diffusion |
|
|
- conditional-generation |
|
|
license: other |
|
|
language: |
|
|
- code |
|
|
model_parameter_count: 658672910 |
|
|
--- |
|
|
|
|
|
|
|
|
## 🧬 NTv3: A Foundation Model for Genomics |
|
|
|
|
|
NTv3 is a series of foundational models designed to understand and generate genomic sequences. It unifies representation learning, functional prediction, and controllable sequence generation within a single, efficient U-Net-like architecture. It also enables the modeling of long-range dependencies, up to 1 Mb of context, at nucleotide resolution. Pretrained on 9 trillion base pairs, NTv3 excels at functional-track prediction and genome annotation across 24 animal and plant species. It can also be fine-tuned into a controllable generative model for genomic sequence design. This is the **generative model** based on NTv3, capable of context-aware DNA sequence generation with desired activity levels.It builds on the post-trained NTv3 model with MDLM based fine-tuning.For more details, please refer to the [NTv3 paper](https://www.biorxiv.org/content/10.64898/2025.12.22.695963v1). |
|
|
|
|
|
## ⚖️ License Summary |
|
|
|
|
|
1. The Licensed Models are **only** available under this License for Non-Commercial Purposes. |
|
|
2. You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License. |
|
|
3. You may **not** use the Licensed Models or any of its Outputs in connection with: |
|
|
1. any Commercial Purposes, unless agreed by Us under a separate licence; |
|
|
2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models; |
|
|
3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or |
|
|
4. in violation of any applicable laws and regulations. |
|
|
|
|
|
## 📋 Model Summary |
|
|
|
|
|
- Architecture: Conditioned U-Net with adaptive layer norms + Transformer stack |
|
|
- Training: Masked Discrete Language Modeling (MDLM) |
|
|
- Conditioning: Species + Activity levels (0-4) |
|
|
- Tokenizer: Character-level over A T C G N + special tokens |
|
|
- Dependencies: transformers >= 4.55.0 |
|
|
- Input size: Model trained on 4096bp sequences with 249bp generation length |
|
|
- Note: Custom code → use `trust_remote_code=True` |
|
|
|