Update citation title

f54a3ed verified 4 days ago

5.29 kB

license: cc-by-4.0
language:
  - dna
tags:
  - biology
  - genomics
  - plant
  - dna
  - foundation-model
  - esm
  - masked-language-modeling
library_name: transformers
pipeline_tag: fill-mask

PlantDNA-FM

PlantDNA-FM is a plant DNA foundation model pre-trained on the complete collection of plant genomes from Ensembl Plants, with a specific focus on gene-body regions — including promoters, exons, introns, and terminators. The model is designed to learn the regulatory grammar and coding/non-coding sequence structure underlying plant gene expression and evolution.

Model Overview

PlantDNA-FM adopts an ESM-style transformer encoder architecture (EsmForMaskedLM) at single-nucleotide resolution. The vocabulary contains only the canonical DNA alphabet (A / T / C / G / N) together with the standard ESM special tokens (<cls>, <pad>, <eos>, <unk>, <mask>), giving an extremely compact 10-token vocabulary that maximizes parameter budget for representation learning rather than tokenization.

Key Technical Features

Single-nucleotide tokenization — no k-mer or BPE compression, preserving base-level resolution required for variant and motif analysis.
Rotary Position Embeddings (RoPE) — enables length extrapolation and position-aware attention without learned absolute position vectors.
Token dropout during training — improves robustness of MLM representations, following the ESM-2 recipe.
Pre-LN transformer encoder with GELU activations and post-encoder layer normalization.
Maximum supported sequence length: 2,048 nucleotides, sufficient to cover a typical plant gene body together with its flanking regulatory regions in a single forward pass.
~133M parameters, providing a favorable trade-off between representational capacity and downstream fine-tuning cost.

Training Data

The model was pre-trained on the entire set of plant reference genomes available from Ensembl Plants, covering a phylogenetically diverse range of species spanning monocots, eudicots, basal angiosperms, gymnosperms, and non-vascular plants.

Rather than sampling uniformly across the genome, the pre-training corpus was constructed around annotated gene bodies, with windows centered on:

Promoter regions (upstream regulatory sequence)
Exons (coding sequence)
Introns (intragenic non-coding sequence)
Terminator regions (downstream regulatory sequence)

This curation forces the model to allocate its capacity toward biologically informative regions, learning regulatory motifs, splice boundaries, codon usage, and gene-architecture signatures rather than genome-wide repeat content.

Pre-training Objectives

PlantDNA-FM was trained with a multi-task self-supervised objective combining four complementary pretext tasks:

Masked Language Modeling (MLM) — the canonical ESM-style objective. Random nucleotide positions are masked and the model is trained to recover the original base from context, learning local sequence dependencies and motif-level patterns.
Gene Annotation Classification — given a sequence window, the model predicts the gene-element annotation (e.g. promoter / exon / intron / terminator). This task injects structural priors about gene architecture directly into the learned representation.
Mutation Detection — the model is trained to identify positions that carry synthetic substitutions relative to the reference, sharpening its sensitivity to evolutionarily implausible or context-violating bases.
Mutation Recovery — beyond detection, the model is trained to restore the original reference base at mutated positions, encouraging it to internalize a generative model of plant sequence likelihood.

Together, these four objectives produce representations that are simultaneously useful for likelihood-based variant scoring, functional element annotation, and general-purpose sequence embedding.

Quick Start

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM")
model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM")

seq = "ATCGATCGATCGATCG"
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits  # (batch, seq_len, vocab_size)

For embedding extraction, load with AutoModel instead of AutoModelForMaskedLM and use the last hidden state.

Limitations

The model is plant-specific; performance on animal, fungal, or microbial sequences is not expected to transfer.
Pre-training emphasizes gene-body and proximal regulatory regions; intergenic, centromeric, and highly repetitive genomic contexts are under-represented.
Maximum context is 2,048 bp; longer genomic regions must be tiled.
Single-nucleotide tokenization means downstream fine-tuning cost scales linearly with sequence length.

Citation

If you use PlantDNA-FM in your research, please cite this repository.

@misc{plantdnafm2026,
  title  = {PlantDNA-FM: Plant DNA Foundation Model},
  author = {Haopeng Yu},
  year   = {2026},
  url    = {https://huggingface.co/ATLASBIOINFO/PlantDNA-FM}
}