AIRI-Institute
/

moderngena-base

genomics-foundation-model

Model card Files Files and versions

moderngena-base / README.md

aspeedok's picture

Update README.md

7036ebb verified about 2 months ago

|

history blame contribute delete

2.48 kB

	---
	tags:
	- dna
	- genomics
	- multispecies
	- masked-lm
	- bert
	- genomics-foundation-model
	- modernbert
	---

	# ModernGENA base
	ModernGENA is a DNA foundation model based on ModernBERT (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling.
	ModernGENA base is the 377M-parameter version introduced in the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models.

	How to load pre-trained model to fine-tune it on classification task: [GENA_LM repository](https://github.com/AIRI-Institute/GENA_LM/tree/main/examples/modernGENA)

	## Technical features
	- ModernBERT-based encoder architecture
	- RoPE positional embeddings
	- hybrid local/global attention
	- pre-norm transformer blocks
	- GeGLU feed-forward layers
	- end-to-end unpadding
	- FlashAttention-based efficient inference on compatible hardware
	- `torch.compile` support

	## Model strengths
	- strong efficiency-quality trade-off
	- higher inference throughput with FlashAttention-based implementations
	- competitive downstream performance on the Nucleotide Transformer benchmark
	- intended to support long genomic contexts

	This makes it a practical baseline for genomic modeling experiments and future architectural comparisons.

	## Tokenization


	ModernGENA uses the [32k BPE vocabulary (AIRI-Institute/gena-lm-bert-base-t2t)](https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t) introduced in GENA-LM, built over the DNA alphabet symbols `A/T/G/C/N`, with special tokens `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]`.

	## Pretraining corpus

	- 443 vertebrate genome assemblies
	- 353,574,093,776 bp total
	- Includes both forward strand and reverse complement sequences
	- Excludes sequences containing ambiguous symbols other than `A/C/G/T`

	To reduce overrepresentation of simple repeats and enrich biologically informative regions, training intervals were sampled around transcription start sites:

	- window: [-16 kbp, +8 kbp] around each unique TSS
	- overlapping intervals merged with BEDTools
	- both strands included for each resulting region

	## Load pretrained model

	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
	model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", trust_remote_code=True, attn_implementation="flash_attention_2")
	```

	## Evaluation
	For evaluation results, see our paper:

	## Citation