moderngena-base / README.md
aspeedok's picture
Update README.md
7036ebb verified
---
tags:
- dna
- genomics
- multispecies
- masked-lm
- bert
- genomics-foundation-model
- modernbert
---
# ModernGENA base
ModernGENA is a DNA foundation model based on **ModernBERT** (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling.
**ModernGENA base** is the 377M-parameter version introduced in the paper *Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models*.
How to load pre-trained model to fine-tune it on classification task: [GENA_LM repository](https://github.com/AIRI-Institute/GENA_LM/tree/main/examples/modernGENA)
## Technical features
- ModernBERT-based encoder architecture
- RoPE positional embeddings
- hybrid local/global attention
- pre-norm transformer blocks
- GeGLU feed-forward layers
- end-to-end unpadding
- FlashAttention-based efficient inference on compatible hardware
- `torch.compile` support
## Model strengths
- strong efficiency-quality trade-off
- higher inference throughput with FlashAttention-based implementations
- competitive downstream performance on the Nucleotide Transformer benchmark
- intended to support long genomic contexts
This makes it a practical baseline for genomic modeling experiments and future architectural comparisons.
## Tokenization
ModernGENA uses the [**32k BPE vocabulary (AIRI-Institute/gena-lm-bert-base-t2t)**](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t) introduced in GENA-LM, built over the DNA alphabet symbols `A/T/G/C/N`, with special tokens `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]`.
## Pretraining corpus
- **443 vertebrate genome assemblies**
- **353,574,093,776 bp** total
- Includes both forward strand and reverse complement sequences
- Excludes sequences containing ambiguous symbols other than `A/C/G/T`
To reduce overrepresentation of simple repeats and enrich biologically informative regions, training intervals were sampled around transcription start sites:
- window: **[-16 kbp, +8 kbp]** around each unique TSS
- overlapping intervals merged with BEDTools
- both strands included for each resulting region
## Load pretrained model
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", trust_remote_code=True, attn_implementation="flash_attention_2")
```
## Evaluation
For evaluation results, see our paper:
## Citation