You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PlantDNA-FM

PlantDNA-FM is a plant DNA foundation model pre-trained on the complete collection of plant genomes from Ensembl Plants, with a specific focus on gene-body regions — including promoters, exons, introns, and terminators. The model is designed to learn the regulatory grammar and coding/non-coding sequence structure underlying plant gene expression and evolution.

Model Overview

PlantDNA-FM adopts an ESM-style transformer encoder architecture (EsmForMaskedLM) at single-nucleotide resolution. The vocabulary contains only the canonical DNA alphabet (A / T / C / G / N) together with the standard ESM special tokens (<cls>, <pad>, <eos>, <unk>, <mask>), giving an extremely compact 10-token vocabulary that maximizes parameter budget for representation learning rather than tokenization.

Key Technical Features

Single-nucleotide tokenization — no k-mer or BPE compression, preserving base-level resolution required for variant and motif analysis.
Rotary Position Embeddings (RoPE) — enables length extrapolation and position-aware attention without learned absolute position vectors.
Token dropout during training — improves robustness of MLM representations, following the ESM-2 recipe.
Pre-LN transformer encoder with GELU activations and post-encoder layer normalization.
Maximum supported sequence length: 2,048 nucleotides, sufficient to cover a typical plant gene body together with its flanking regulatory regions in a single forward pass.
~133M parameters, providing a favorable trade-off between representational capacity and downstream fine-tuning cost.

Training Data

The model was pre-trained on the entire set of plant reference genomes available from Ensembl Plants, covering a phylogenetically diverse range of species spanning monocots, eudicots, basal angiosperms, gymnosperms, and non-vascular plants.

Rather than sampling uniformly across the genome, the pre-training corpus was constructed around annotated gene bodies, with windows centered on:

Promoter regions (upstream regulatory sequence)
Exons (coding sequence)
Introns (intragenic non-coding sequence)
Terminator regions (downstream regulatory sequence)

This curation forces the model to allocate its capacity toward biologically informative regions, learning regulatory motifs, splice boundaries, codon usage, and gene-architecture signatures rather than genome-wide repeat content.

Pre-training Objectives

PlantDNA-FM was trained with a multi-task self-supervised objective combining four complementary pretext tasks:

Masked Language Modeling (MLM) — the canonical ESM-style objective. Random nucleotide positions are masked and the model is trained to recover the original base from context, learning local sequence dependencies and motif-level patterns.
Gene Annotation Classification — given a sequence window, the model predicts the gene-element annotation (e.g. promoter / exon / intron / terminator). This task injects structural priors about gene architecture directly into the learned representation.
Mutation Detection — the model is trained to identify positions that carry synthetic substitutions relative to the reference, sharpening its sensitivity to evolutionarily implausible or context-violating bases.
Mutation Recovery — beyond detection, the model is trained to restore the original reference base at mutated positions, encouraging it to internalize a generative model of plant sequence likelihood.

Together, these four objectives produce representations that are simultaneously useful for likelihood-based variant scoring, functional element annotation, and general-purpose sequence embedding.

Quick Start

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM")
model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM")

seq = "ATCGATCGATCGATCG"
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits  # (batch, seq_len, vocab_size)

For embedding extraction, load with AutoModel instead of AutoModelForMaskedLM and use the last hidden state.

Limitations

The model is plant-specific; performance on animal, fungal, or microbial sequences is not expected to transfer.
Pre-training emphasizes gene-body and proximal regulatory regions; intergenic, centromeric, and highly repetitive genomic contexts are under-represented.
Maximum context is 2,048 bp; longer genomic regions must be tiled.
Single-nucleotide tokenization means downstream fine-tuning cost scales linearly with sequence length.

Citation

If you use PlantDNA-FM in your research, please cite this repository.

@misc{plantdnafm2026,
  title  = {PlantDNA-FM: Plant DNA Foundation Model},
  author = {Haopeng Yu},
  year   = {2026},
  url    = {https://huggingface.co/ATLASBIOINFO/PlantDNA-FM}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32