README.md · InstaDeepAI/NTv3

NTv3_generative / README.md

ztang-id

Upload NTv3 Generative model

7ebbf34 verified 13 days ago

preview code

raw

history blame contribute delete

2.57 kB

metadata

library_name: transformers
pipeline_tag: text-generation
tags:
  - genomics
  - dna
  - generative
  - ntv3
  - enhancer-generation
  - mdlm
  - diffusion
  - conditional-generation
license: other
language:
  - code
model_parameter_count: 658672910

🧬 NTv3: A Foundation Model for Genomics

NTv3 is a series of foundational models designed to understand and generate genomic sequences. It unifies representation learning, functional prediction, and controllable sequence generation within a single, efficient U-Net-like architecture. It also enables the modeling of long-range dependencies, up to 1 Mb of context, at nucleotide resolution. Pretrained on 9 trillion base pairs, NTv3 excels at functional-track prediction and genome annotation across 24 animal and plant species. It can also be fine-tuned into a controllable generative model for genomic sequence design. This is the generative model based on NTv3, capable of context-aware DNA sequence generation with desired activity levels.It builds on the post-trained NTv3 model with MDLM based fine-tuning.For more details, please refer to the NTv3 paper.

⚖️ License Summary

The Licensed Models are only available under this License for Non-Commercial Purposes.
You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License.
You may not use the Licensed Models or any of its Outputs in connection with:
1. any Commercial Purposes, unless agreed by Us under a separate licence;
2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models;
3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or
4. in violation of any applicable laws and regulations.

📋 Model Summary

Architecture: Conditioned U-Net with adaptive layer norms + Transformer stack
Training: Masked Discrete Language Modeling (MDLM)
Conditioning: Species + Activity levels (0-4)
Tokenizer: Character-level over A T C G N + special tokens
Dependencies: transformers >= 4.55.0
Input size: Model trained on 4096bp sequences with 249bp generation length
Note: Custom code → use trust_remote_code=True