Instructions to use ATLASBIOINFO/PlantDNA-FM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ATLASBIOINFO/PlantDNA-FM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="ATLASBIOINFO/PlantDNA-FM")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM") model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM") - Notebooks
- Google Colab
- Kaggle
PlantDNA-FM
PlantDNA-FM is a plant DNA foundation model pre-trained on the complete collection of plant genomes from Ensembl Plants, with a specific focus on gene-body regions β including promoters, exons, introns, and terminators. The model is designed to learn the regulatory grammar and coding/non-coding sequence structure underlying plant gene expression and evolution.
Model Overview
PlantDNA-FM adopts an ESM-style transformer encoder architecture
(EsmForMaskedLM) at single-nucleotide resolution. The vocabulary contains
only the canonical DNA alphabet (A / T / C / G / N) together with the
standard ESM special tokens (<cls>, <pad>, <eos>, <unk>, <mask>),
giving an extremely compact 10-token vocabulary that maximizes parameter
budget for representation learning rather than tokenization.
Key Technical Features
- Single-nucleotide tokenization β no k-mer or BPE compression, preserving base-level resolution required for variant and motif analysis.
- Rotary Position Embeddings (RoPE) β enables length extrapolation and position-aware attention without learned absolute position vectors.
- Token dropout during training β improves robustness of MLM representations, following the ESM-2 recipe.
- Pre-LN transformer encoder with GELU activations and post-encoder layer normalization.
- Maximum supported sequence length: 2,048 nucleotides, sufficient to cover a typical plant gene body together with its flanking regulatory regions in a single forward pass.
- ~133M parameters, providing a favorable trade-off between representational capacity and downstream fine-tuning cost.
Training Data
The model was pre-trained on the entire set of plant reference genomes available from Ensembl Plants, covering a phylogenetically diverse range of species spanning monocots, eudicots, basal angiosperms, gymnosperms, and non-vascular plants.
Rather than sampling uniformly across the genome, the pre-training corpus was constructed around annotated gene bodies, with windows centered on:
- Promoter regions (upstream regulatory sequence)
- Exons (coding sequence)
- Introns (intragenic non-coding sequence)
- Terminator regions (downstream regulatory sequence)
This curation forces the model to allocate its capacity toward biologically informative regions, learning regulatory motifs, splice boundaries, codon usage, and gene-architecture signatures rather than genome-wide repeat content.
Pre-training Objectives
PlantDNA-FM was trained with a multi-task self-supervised objective combining four complementary pretext tasks:
Masked Language Modeling (MLM) β the canonical ESM-style objective. Random nucleotide positions are masked and the model is trained to recover the original base from context, learning local sequence dependencies and motif-level patterns.
Gene Annotation Classification β given a sequence window, the model predicts the gene-element annotation (e.g. promoter / exon / intron / terminator). This task injects structural priors about gene architecture directly into the learned representation.
Mutation Detection β the model is trained to identify positions that carry synthetic substitutions relative to the reference, sharpening its sensitivity to evolutionarily implausible or context-violating bases.
Mutation Recovery β beyond detection, the model is trained to restore the original reference base at mutated positions, encouraging it to internalize a generative model of plant sequence likelihood.
Together, these four objectives produce representations that are simultaneously useful for likelihood-based variant scoring, functional element annotation, and general-purpose sequence embedding.
Quick Start
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM")
model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM")
seq = "ATCGATCGATCGATCG"
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits # (batch, seq_len, vocab_size)
For embedding extraction, load with AutoModel instead of
AutoModelForMaskedLM and use the last hidden state.
Limitations
- The model is plant-specific; performance on animal, fungal, or microbial sequences is not expected to transfer.
- Pre-training emphasizes gene-body and proximal regulatory regions; intergenic, centromeric, and highly repetitive genomic contexts are under-represented.
- Maximum context is 2,048 bp; longer genomic regions must be tiled.
- Single-nucleotide tokenization means downstream fine-tuning cost scales linearly with sequence length.
Citation
If you use PlantDNA-FM in your research, please cite this repository.
@misc{plantdnafm2026,
title = {PlantDNA-FM: Plant DNA Foundation Model},
author = {Haopeng Yu},
year = {2026},
url = {https://huggingface.co/ATLASBIOINFO/PlantDNA-FM}
}
- Downloads last month
- -