PlantPLM-35M

ESM-2 35M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.

This is a domain-adapted version of facebook/esm2_t12_35M_UR50D, fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.

Part of the Plant-PLM - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.

Model Description

Property	Value
Base model	`facebook/esm2_t12_35M_UR50D`
Architecture	ESM-2 · 12 layers · hidden=480 · heads=20 · FFN=1920
Position embeddings	Rotary (RoPE)
Vocabulary	33 tokens (20 standard + rare amino acids + special tokens)
Parameters	33.5M (full-parameter continued pretraining)
Training objective	Masked Language Modeling (MLM, 15% masking)

Training Data

Property	Value
Source	UniProt TrEMBL — Viridiplantae (plant kingdom) subset
Sequences	19,938,415 protein sequences
Avg sequence length	339 AA · median 291 AA
Estimated total tokens	~6.76 billion amino acid tokens
Tokens seen during training	546 million (≈ 0.08 passes over the full dataset)

Training Details

Hyperparameter	Value
Training steps	55,000 optimizer steps
Batch size	64 sequences (32 per micro-batch × 2 gradient accumulation steps)
Optimizer	AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01
Learning rate	2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting)
LR schedule	Linear warmup (500 steps) → linear decay
Gradient clipping	1.0
Precision	16-bit mixed (fp16 activations, fp32 optimizer states)

Final metrics (validation set, 5% holdout):

Metric	Value
`val/mlm_loss`	2.075
`val/perplexity`	7.96

Downstream Task Performance (Linear Probe)

Frozen [CLS] embeddings evaluated on 2,000 reviewed Arabidopsis thaliana proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla facebook/esm2_t12_35M_UR50D baseline.

Task	Vanilla ESM-2 35M	PlantPLM-35M	Δ
Subcellular localization (9-class accuracy)	91.87%	94.28%	+2.41%
Subcellular localization (macro-F1)	92.57%	94.86%	+2.29%
GO-term prediction (macro-AUROC, top-50 terms)	94.26%	94.82%	+0.56%

Test set: 332 proteins (localization) · 396 proteins (GO terms) · 9 localization classes · 50 GO terms evaluated.

Usage

from transformers import EsmForMaskedLM, EsmTokenizer
import torch

model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-35M")
tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-35M")

# --- Masked token prediction ---
sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
inputs = tokenizer(sequence, return_tensors="pt")

# mask one position
inputs["input_ids"][0, 5] = tokenizer.mask_token_id

with torch.no_grad():
    logits = model(**inputs).logits

masked_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
top5 = logits[0, masked_pos].topk(5)
print(tokenizer.convert_ids_to_tokens(top5.indices.tolist()))

# --- Sequence embedding ([CLS] token) ---
inputs = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
    hidden = model.esm(**inputs).last_hidden_state
cls_embedding = hidden[0, 0, :]   # shape: [480]
print("Embedding shape:", cls_embedding.shape)

Intended Use

Plant protein function prediction — GO term annotation, subcellular localization, signal peptide detection
Plant-specific protein embeddings — clustering, retrieval, similarity search
Transfer learning starting point — fine-tune on small labeled plant protein datasets
Baseline comparison — benchmark against PlantPLM-8M / 150M / 650M variants

Out-of-scope Use

Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original facebook/esm2_t12_35M_UR50D for general protein tasks
Structural prediction — not trained for structure; use ESMFold for that

Limitations

Trained for only 0.08 passes over the plant corpus (546M / 6.76B tokens) — larger models in this collection see more of the data
For highest downstream accuracy, the 150M variant is recommended

Citation

If you use this model, please cite:

@misc{sarkar2026plantplm,
  author       = {Sarkar, Dipayan},
  title        = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-35M}},
}

Downloads last month: 18

Safetensors

Model size

33.5M params

Tensor type

F32

Model tree for dipayan26/PlantPLM-35M

Base model

facebook/esm2_t12_35M_UR50D

Finetuned

(60)

this model

Collection including dipayan26/PlantPLM-35M

Plant-PLM

Collection

ESM-2 models domain-adapted on plant proteins • 2 items • Updated 8 days ago