File size: 5,259 Bytes
2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ba8115a b089bcc ec78ee0 2e6bb66 ec78ee0 2e6bb66 e83e8d7 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 e83e8d7 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 ec78ee0 2e6bb66 e83e8d7 2e6bb66 e83e8d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | ---
language:
- en
license: mit
tags:
- biology
- protein
- esm2
- plant
- viridiplantae
- masked-language-modeling
- domain-adaptation
base_model: facebook/esm2_t6_8M_UR50D
datasets:
- uniprot-trembl-viridiplantae
pipeline_tag: fill-mask
---
# PlantPLM-8M
<img src="Plant_PLM_logo.png" alt="Alt Text" width="800">
**ESM-2 8M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.**
This is a domain-adapted version of [`facebook/esm2_t6_8M_UR50D`](https://huggingface.co/facebook/esm2_t6_8M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
---
## Model Description
| Property | Value |
|---|---|
| Base model | `facebook/esm2_t6_8M_UR50D` |
| Architecture | ESM-2 · 6 layers · hidden=320 · heads=20 · FFN=1280 |
| Position embeddings | Rotary (RoPE) |
| Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
| Parameters | 7.5M (full-parameter continued pretraining) |
| Training objective | Masked Language Modeling (MLM, 15% masking) |
---
## Training Data
| Property | Value |
|---|---|
| Source | UniProt TrEMBL — Viridiplantae (plant kingdom) subset |
| Sequences | **19,938,415** protein sequences |
| Avg sequence length | 339 AA · median 291 AA |
| Estimated total tokens | **~6.76 billion** amino acid tokens |
| Tokens seen during training | **800 million** (≈ 0.12 passes over the full dataset) |
---
## Training Details
| Hyperparameter | Value |
|---|---|
| Token budget | 800M tokens (training stopped at budget, not epoch end) |
| Batch size | 64 sequences |
| Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
| Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
| LR schedule | Linear warmup (500 steps) → linear decay |
| Gradient clipping | 1.0 |
| Precision | 16-bit mixed (bf16 activations, fp32 optimizer states) |
**Final metrics (validation set, 5% holdout):**
| Metric | Value |
|---|---|
| `val/mlm_loss` | 2.292 |
| `val/perplexity` | 9.92 |
| `val/masked_token_acc` | 31.0% |
---
## Downstream Task Performance (Linear Probe)
Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t6_8M_UR50D` baseline.
| Task | Vanilla ESM-2 8M | PlantPLM-8M | Δ |
|---|---|---|---|
| Subcellular localization (9-class accuracy) | 91.6% | **93.7%** | +2.1% |
| GO-term prediction (macro-AUROC, top-50 terms) | 94.7% | **95.0%** | +0.3% |
---
## Usage
```python
from transformers import EsmForMaskedLM, EsmTokenizer
import torch
model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-8M")
tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-8M")
# --- Masked token prediction ---
sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
inputs = tokenizer(sequence, return_tensors="pt")
# mask one position
inputs["input_ids"][0, 5] = tokenizer.mask_token_id
with torch.no_grad():
logits = model(**inputs).logits
masked_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
top5 = logits[0, masked_pos].topk(5)
print(tokenizer.convert_ids_to_tokens(top5.indices.tolist()))
# --- Sequence embedding ([CLS] token) ---
inputs = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
hidden = model.esm(**inputs).last_hidden_state
cls_embedding = hidden[0, 0, :] # shape: [320]
print("Embedding shape:", cls_embedding.shape)
```
---
## Intended Use
- **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
- **Plant-specific protein embeddings** — clustering, retrieval, similarity search
- **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
- **Baseline comparison** — benchmark against larger PlantPLM-35M / 150M / 650M variants
## Out-of-scope Use
- Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t6_8M_UR50D` for general protein tasks
- Structural prediction — not trained for structure; use ESMFold for that
---
## Limitations
- Trained for only 0.12 passes over the plant corpus (800M / 6.76B tokens) — larger models in this collection see more of the data
- 8M capacity limits representation richness; the 35M and 150M variants are recommended for downstream fine-tuning
---
## Citation
If you use this model, please cite:
```bibtex
@misc{sarkar2026plantplm,
author = {Sarkar, Dipayan},
title = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-8M}},
}
```
---
<!-- ## Training Code
[github.com/Dipayan26/Plant-Protein-BERT](https://github.com/Dipayan26/Plant-Protein-BERT) -->
|