| --- |
| tags: |
| - dna |
| - genomics |
| - multispecies |
| - masked-lm |
| - bert |
| - genomics-foundation-model |
| - modernbert |
| --- |
| |
| # ModernGENA base |
| ModernGENA is a DNA foundation model based on **ModernBERT** (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling. |
| **ModernGENA base** is the 377M-parameter version introduced in the paper *Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models*. |
|
|
| How to load pre-trained model to fine-tune it on classification task: [GENA_LM repository](https://github.com/AIRI-Institute/GENA_LM/tree/main/examples/modernGENA) |
|
|
| ## Technical features |
| - ModernBERT-based encoder architecture |
| - RoPE positional embeddings |
| - hybrid local/global attention |
| - pre-norm transformer blocks |
| - GeGLU feed-forward layers |
| - end-to-end unpadding |
| - FlashAttention-based efficient inference on compatible hardware |
| - `torch.compile` support |
|
|
| ## Model strengths |
| - strong efficiency-quality trade-off |
| - higher inference throughput with FlashAttention-based implementations |
| - competitive downstream performance on the Nucleotide Transformer benchmark |
| - intended to support long genomic contexts |
|
|
| This makes it a practical baseline for genomic modeling experiments and future architectural comparisons. |
|
|
| ## Tokenization |
|
|
|
|
| ModernGENA uses the [**32k BPE vocabulary (AIRI-Institute/gena-lm-bert-base-t2t)**](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t) introduced in GENA-LM, built over the DNA alphabet symbols `A/T/G/C/N`, with special tokens `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]`. |
|
|
| ## Pretraining corpus |
|
|
| - **443 vertebrate genome assemblies** |
| - **353,574,093,776 bp** total |
| - Includes both forward strand and reverse complement sequences |
| - Excludes sequences containing ambiguous symbols other than `A/C/G/T` |
|
|
| To reduce overrepresentation of simple repeats and enrich biologically informative regions, training intervals were sampled around transcription start sites: |
|
|
| - window: **[-16 kbp, +8 kbp]** around each unique TSS |
| - overlapping intervals merged with BEDTools |
| - both strands included for each resulting region |
|
|
| ## Load pretrained model |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| |
| tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t") |
| model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", trust_remote_code=True, attn_implementation="flash_attention_2") |
| ``` |
|
|
| ## Evaluation |
| For evaluation results, see our paper: |
|
|
| ## Citation |