Update citation title

f54a3ed verified 5 days ago

5.29 kB

	---
	license: cc-by-4.0
	language:
	- dna
	tags:
	- biology
	- genomics
	- plant
	- dna
	- foundation-model
	- esm
	- masked-language-modeling
	library_name: transformers
	pipeline_tag: fill-mask
	---

	# PlantDNA-FM

	PlantDNA-FM is a plant DNA foundation model pre-trained on the complete
	collection of plant genomes from Ensembl Plants, with a specific focus on
	gene-body regions — including promoters, exons, introns, and terminators.
	The model is designed to learn the regulatory grammar and coding/non-coding
	sequence structure underlying plant gene expression and evolution.

	## Model Overview

	PlantDNA-FM adopts an ESM-style transformer encoder architecture
	(`EsmForMaskedLM`) at single-nucleotide resolution. The vocabulary contains
	only the canonical DNA alphabet (A / T / C / G / N) together with the
	standard ESM special tokens (`<cls>`, `<pad>`, `<eos>`, `<unk>`, `<mask>`),
	giving an extremely compact 10-token vocabulary that maximizes parameter
	budget for representation learning rather than tokenization.

	### Key Technical Features

	- Single-nucleotide tokenization — no k-mer or BPE compression, preserving
	base-level resolution required for variant and motif analysis.
	- Rotary Position Embeddings (RoPE) — enables length extrapolation and
	position-aware attention without learned absolute position vectors.
	- Token dropout during training — improves robustness of MLM
	representations, following the ESM-2 recipe.
	- Pre-LN transformer encoder with GELU activations and post-encoder
	layer normalization.
	- Maximum supported sequence length: 2,048 nucleotides, sufficient to
	cover a typical plant gene body together with its flanking regulatory
	regions in a single forward pass.
	- ~133M parameters, providing a favorable trade-off between
	representational capacity and downstream fine-tuning cost.

	## Training Data

	The model was pre-trained on the **entire set of plant reference genomes
	available from Ensembl Plants**, covering a phylogenetically diverse range of
	species spanning monocots, eudicots, basal angiosperms, gymnosperms, and
	non-vascular plants.

	Rather than sampling uniformly across the genome, the pre-training corpus
	was constructed around annotated gene bodies, with windows centered on:

	- Promoter regions (upstream regulatory sequence)
	- Exons (coding sequence)
	- Introns (intragenic non-coding sequence)
	- Terminator regions (downstream regulatory sequence)

	This curation forces the model to allocate its capacity toward
	biologically informative regions, learning regulatory motifs, splice
	boundaries, codon usage, and gene-architecture signatures rather than
	genome-wide repeat content.

	## Pre-training Objectives

	PlantDNA-FM was trained with a multi-task self-supervised objective
	combining four complementary pretext tasks:

	1. Masked Language Modeling (MLM) — the canonical ESM-style objective.
	Random nucleotide positions are masked and the model is trained to
	recover the original base from context, learning local sequence
	dependencies and motif-level patterns.

	2. Gene Annotation Classification — given a sequence window, the model
	predicts the gene-element annotation (e.g. promoter / exon / intron /
	terminator). This task injects structural priors about gene
	architecture directly into the learned representation.

	3. Mutation Detection — the model is trained to identify positions that
	carry synthetic substitutions relative to the reference, sharpening its
	sensitivity to evolutionarily implausible or context-violating bases.

	4. Mutation Recovery — beyond detection, the model is trained to
	restore the original reference base at mutated positions, encouraging
	it to internalize a generative model of plant sequence likelihood.

	Together, these four objectives produce representations that are
	simultaneously useful for likelihood-based variant scoring,
	functional element annotation, and general-purpose sequence embedding.

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("ATLASBIOINFO/PlantDNA-FM")
	model = AutoModelForMaskedLM.from_pretrained("ATLASBIOINFO/PlantDNA-FM")

	seq = "ATCGATCGATCGATCG"
	inputs = tokenizer(seq, return_tensors="pt")
	outputs = model(**inputs)
	logits = outputs.logits # (batch, seq_len, vocab_size)
	```

	For embedding extraction, load with `AutoModel` instead of
	`AutoModelForMaskedLM` and use the last hidden state.

	## Limitations

	- The model is plant-specific; performance on animal, fungal, or
	microbial sequences is not expected to transfer.
	- Pre-training emphasizes gene-body and proximal regulatory regions;
	intergenic, centromeric, and highly repetitive genomic contexts are
	under-represented.
	- Maximum context is 2,048 bp; longer genomic regions must be tiled.
	- Single-nucleotide tokenization means downstream fine-tuning cost scales
	linearly with sequence length.

	## Citation

	If you use PlantDNA-FM in your research, please cite this repository.

	```
	@misc{plantdnafm2026,
	title = {PlantDNA-FM: Plant DNA Foundation Model},
	author = {Haopeng Yu},
	year = {2026},
	url = {https://huggingface.co/ATLASBIOINFO/PlantDNA-FM}
	}
	```