--- tags: - dna - genomics - multispecies - masked-lm - bert - genomics-foundation-model - modernbert --- # ModernGENA base ModernGENA is a DNA foundation model based on **ModernBERT** (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling. **ModernGENA base** is the 377M-parameter version introduced in the paper *Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models*. How to load pre-trained model to fine-tune it on classification task: [GENA_LM repository](https://github.com/AIRI-Institute/GENA_LM/tree/main/examples/modernGENA) ## Technical features - ModernBERT-based encoder architecture - RoPE positional embeddings - hybrid local/global attention - pre-norm transformer blocks - GeGLU feed-forward layers - end-to-end unpadding - FlashAttention-based efficient inference on compatible hardware - `torch.compile` support ## Model strengths - strong efficiency-quality trade-off - higher inference throughput with FlashAttention-based implementations - competitive downstream performance on the Nucleotide Transformer benchmark - intended to support long genomic contexts This makes it a practical baseline for genomic modeling experiments and future architectural comparisons. ## Tokenization ModernGENA uses the [**32k BPE vocabulary (AIRI-Institute/gena-lm-bert-base-t2t)**](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t) introduced in GENA-LM, built over the DNA alphabet symbols `A/T/G/C/N`, with special tokens `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]`. ## Pretraining corpus - **443 vertebrate genome assemblies** - **353,574,093,776 bp** total - Includes both forward strand and reverse complement sequences - Excludes sequences containing ambiguous symbols other than `A/C/G/T` To reduce overrepresentation of simple repeats and enrich biologically informative regions, training intervals were sampled around transcription start sites: - window: **[-16 kbp, +8 kbp]** around each unique TSS - overlapping intervals merged with BEDTools - both strands included for each resulting region ## Load pretrained model ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t") model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", trust_remote_code=True, attn_implementation="flash_attention_2") ``` ## Evaluation For evaluation results, see our paper: ## Citation