Fill-Mask
Transformers
Safetensors
Upper Grand Valley Dani
esm
biology
genomics
plant
foundation-model
masked-language-modeling
Instructions to use ATLASBIOINFO/PlantDNA-FM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ATLASBIOINFO/PlantDNA-FM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="ATLASBIOINFO/PlantDNA-FM")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM") model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| language: | |
| - dna | |
| tags: | |
| - biology | |
| - genomics | |
| - plant | |
| - dna | |
| - foundation-model | |
| - esm | |
| - masked-language-modeling | |
| library_name: transformers | |
| pipeline_tag: fill-mask | |
| # PlantDNA-FM | |
| **PlantDNA-FM** is a plant DNA foundation model pre-trained on the complete | |
| collection of plant genomes from **Ensembl Plants**, with a specific focus on | |
| **gene-body regions** — including promoters, exons, introns, and terminators. | |
| The model is designed to learn the regulatory grammar and coding/non-coding | |
| sequence structure underlying plant gene expression and evolution. | |
| ## Model Overview | |
| PlantDNA-FM adopts an **ESM-style transformer encoder** architecture | |
| (`EsmForMaskedLM`) at single-nucleotide resolution. The vocabulary contains | |
| only the canonical DNA alphabet (A / T / C / G / N) together with the | |
| standard ESM special tokens (`<cls>`, `<pad>`, `<eos>`, `<unk>`, `<mask>`), | |
| giving an extremely compact 10-token vocabulary that maximizes parameter | |
| budget for representation learning rather than tokenization. | |
| ### Key Technical Features | |
| - **Single-nucleotide tokenization** — no k-mer or BPE compression, preserving | |
| base-level resolution required for variant and motif analysis. | |
| - **Rotary Position Embeddings (RoPE)** — enables length extrapolation and | |
| position-aware attention without learned absolute position vectors. | |
| - **Token dropout** during training — improves robustness of MLM | |
| representations, following the ESM-2 recipe. | |
| - **Pre-LN transformer encoder** with GELU activations and post-encoder | |
| layer normalization. | |
| - **Maximum supported sequence length: 2,048 nucleotides**, sufficient to | |
| cover a typical plant gene body together with its flanking regulatory | |
| regions in a single forward pass. | |
| - **~133M parameters**, providing a favorable trade-off between | |
| representational capacity and downstream fine-tuning cost. | |
| ## Training Data | |
| The model was pre-trained on the **entire set of plant reference genomes | |
| available from Ensembl Plants**, covering a phylogenetically diverse range of | |
| species spanning monocots, eudicots, basal angiosperms, gymnosperms, and | |
| non-vascular plants. | |
| Rather than sampling uniformly across the genome, the pre-training corpus | |
| was **constructed around annotated gene bodies**, with windows centered on: | |
| - **Promoter regions** (upstream regulatory sequence) | |
| - **Exons** (coding sequence) | |
| - **Introns** (intragenic non-coding sequence) | |
| - **Terminator regions** (downstream regulatory sequence) | |
| This curation forces the model to allocate its capacity toward | |
| biologically informative regions, learning regulatory motifs, splice | |
| boundaries, codon usage, and gene-architecture signatures rather than | |
| genome-wide repeat content. | |
| ## Pre-training Objectives | |
| PlantDNA-FM was trained with a **multi-task self-supervised objective** | |
| combining four complementary pretext tasks: | |
| 1. **Masked Language Modeling (MLM)** — the canonical ESM-style objective. | |
| Random nucleotide positions are masked and the model is trained to | |
| recover the original base from context, learning local sequence | |
| dependencies and motif-level patterns. | |
| 2. **Gene Annotation Classification** — given a sequence window, the model | |
| predicts the gene-element annotation (e.g. promoter / exon / intron / | |
| terminator). This task injects structural priors about gene | |
| architecture directly into the learned representation. | |
| 3. **Mutation Detection** — the model is trained to identify positions that | |
| carry synthetic substitutions relative to the reference, sharpening its | |
| sensitivity to evolutionarily implausible or context-violating bases. | |
| 4. **Mutation Recovery** — beyond detection, the model is trained to | |
| *restore* the original reference base at mutated positions, encouraging | |
| it to internalize a generative model of plant sequence likelihood. | |
| Together, these four objectives produce representations that are | |
| simultaneously useful for **likelihood-based variant scoring**, | |
| **functional element annotation**, and **general-purpose sequence embedding**. | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM") | |
| model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM") | |
| seq = "ATCGATCGATCGATCG" | |
| inputs = tokenizer(seq, return_tensors="pt") | |
| outputs = model(**inputs) | |
| logits = outputs.logits # (batch, seq_len, vocab_size) | |
| ``` | |
| For embedding extraction, load with `AutoModel` instead of | |
| `AutoModelForMaskedLM` and use the last hidden state. | |
| ## Limitations | |
| - The model is **plant-specific**; performance on animal, fungal, or | |
| microbial sequences is not expected to transfer. | |
| - Pre-training emphasizes **gene-body and proximal regulatory regions**; | |
| intergenic, centromeric, and highly repetitive genomic contexts are | |
| under-represented. | |
| - Maximum context is **2,048 bp**; longer genomic regions must be tiled. | |
| - Single-nucleotide tokenization means downstream fine-tuning cost scales | |
| linearly with sequence length. | |
| ## Citation | |
| If you use PlantDNA-FM in your research, please cite this repository. | |
| ``` | |
| @misc{plantdnafm2026, | |
| title = {PlantDNA-FM: Plant DNA Foundation Model}, | |
| author = {Haopeng Yu}, | |
| year = {2026}, | |
| url = {https://huggingface.co/ATLASBIOINFO/PlantDNA-FM} | |
| } | |
| ``` | |