| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - biology |
| - protein |
| - esm2 |
| - plant |
| - viridiplantae |
| - masked-language-modeling |
| - domain-adaptation |
| base_model: facebook/esm2_t6_8M_UR50D |
| datasets: |
| - uniprot-trembl-viridiplantae |
| pipeline_tag: fill-mask |
| --- |
| |
| # PlantPLM-8M |
|
|
| <img src="Plant_PLM_logo.png" alt="Alt Text" width="800"> |
|
|
|
|
|
|
| **ESM-2 8M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.** |
|
|
| This is a domain-adapted version of [`facebook/esm2_t6_8M_UR50D`](https://huggingface.co/facebook/esm2_t6_8M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline. |
|
|
| Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus. |
|
|
| --- |
|
|
| ## Model Description |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | `facebook/esm2_t6_8M_UR50D` | |
| | Architecture | ESM-2 · 6 layers · hidden=320 · heads=20 · FFN=1280 | |
| | Position embeddings | Rotary (RoPE) | |
| | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) | |
| | Parameters | 7.5M (full-parameter continued pretraining) | |
| | Training objective | Masked Language Modeling (MLM, 15% masking) | |
|
|
| --- |
|
|
| ## Training Data |
|
|
| | Property | Value | |
| |---|---| |
| | Source | UniProt TrEMBL — Viridiplantae (plant kingdom) subset | |
| | Sequences | **19,938,415** protein sequences | |
| | Avg sequence length | 339 AA · median 291 AA | |
| | Estimated total tokens | **~6.76 billion** amino acid tokens | |
| | Tokens seen during training | **800 million** (≈ 0.12 passes over the full dataset) | |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Token budget | 800M tokens (training stopped at budget, not epoch end) | |
| | Batch size | 64 sequences | |
| | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 | |
| | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) | |
| | LR schedule | Linear warmup (500 steps) → linear decay | |
| | Gradient clipping | 1.0 | |
| | Precision | 16-bit mixed (bf16 activations, fp32 optimizer states) | |
| |
| |
| **Final metrics (validation set, 5% holdout):** |
| |
| | Metric | Value | |
| |---|---| |
| | `val/mlm_loss` | 2.292 | |
| | `val/perplexity` | 9.92 | |
| | `val/masked_token_acc` | 31.0% | |
|
|
| --- |
|
|
| ## Downstream Task Performance (Linear Probe) |
|
|
| Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t6_8M_UR50D` baseline. |
|
|
| | Task | Vanilla ESM-2 8M | PlantPLM-8M | Δ | |
| |---|---|---|---| |
| | Subcellular localization (9-class accuracy) | 91.6% | **93.7%** | +2.1% | |
| | GO-term prediction (macro-AUROC, top-50 terms) | 94.7% | **95.0%** | +0.3% | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import EsmForMaskedLM, EsmTokenizer |
| import torch |
| |
| model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-8M") |
| tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-8M") |
| |
| # --- Masked token prediction --- |
| sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK" |
| inputs = tokenizer(sequence, return_tensors="pt") |
| |
| # mask one position |
| inputs["input_ids"][0, 5] = tokenizer.mask_token_id |
| |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| |
| masked_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1] |
| top5 = logits[0, masked_pos].topk(5) |
| print(tokenizer.convert_ids_to_tokens(top5.indices.tolist())) |
| |
| # --- Sequence embedding ([CLS] token) --- |
| inputs = tokenizer(sequence, return_tensors="pt") |
| with torch.no_grad(): |
| hidden = model.esm(**inputs).last_hidden_state |
| cls_embedding = hidden[0, 0, :] # shape: [320] |
| print("Embedding shape:", cls_embedding.shape) |
| ``` |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection |
| - **Plant-specific protein embeddings** — clustering, retrieval, similarity search |
| - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets |
| - **Baseline comparison** — benchmark against larger PlantPLM-35M / 150M / 650M variants |
|
|
| ## Out-of-scope Use |
|
|
| - Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t6_8M_UR50D` for general protein tasks |
| - Structural prediction — not trained for structure; use ESMFold for that |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Trained for only 0.12 passes over the plant corpus (800M / 6.76B tokens) — larger models in this collection see more of the data |
| - 8M capacity limits representation richness; the 35M and 150M variants are recommended for downstream fine-tuning |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{sarkar2026plantplm, |
| author = {Sarkar, Dipayan}, |
| title = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-8M}}, |
| } |
| ``` |
|
|
| --- |
|
|
| <!-- ## Training Code |
|
|
| [github.com/Dipayan26/Plant-Protein-BERT](https://github.com/Dipayan26/Plant-Protein-BERT) --> |
|
|