dipayan26
/

PlantPLM-8M

@@ -10,21 +10,21 @@ tags:
   - viridiplantae
   - masked-language-modeling
   - domain-adaptation
-base_model: facebook/esm2_t12_35M_UR50D
 datasets:
   - uniprot-trembl-viridiplantae
 pipeline_tag: fill-mask
 ---
-# PlantPLM-35M
 <img src="Plant_PLM_logo.png" alt="Alt Text" width="800">
-**ESM-2 35M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.**
-This is a domain-adapted version of [`facebook/esm2_t12_35M_UR50D`](https://huggingface.co/facebook/esm2_t12_35M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
 Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
@@ -34,11 +34,11 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
 | Property | Value |
 |---|---|
-| Base model | `facebook/esm2_t12_35M_UR50D` |
-| Architecture | ESM-2 · 12 layers · hidden=480 · heads=20 · FFN=1920 |
 | Position embeddings | Rotary (RoPE) |
 | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
-| Parameters | 33.5M (full-parameter continued pretraining) |
 | Training objective | Masked Language Modeling (MLM, 15% masking) |
 ---
@@ -51,7 +51,7 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
 | Sequences | **19,938,415** protein sequences |
 | Avg sequence length | 339 AA · median 291 AA |
 | Estimated total tokens | **~6.76 billion** amino acid tokens |
-| Tokens seen during training | **546 million** (≈ 0.08 passes over the full dataset) |
 ---
@@ -59,35 +59,33 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
 | Hyperparameter | Value |
 |---|---|
-| Training steps | 55,000 optimizer steps |
-| Batch size | 64 sequences (32 per micro-batch × 2 gradient accumulation steps) |
 | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
 | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
 | LR schedule | Linear warmup (500 steps) → linear decay |
 | Gradient clipping | 1.0 |
-| Precision | 16-bit mixed (fp16 activations, fp32 optimizer states) |
 **Final metrics (validation set, 5% holdout):**
 | Metric | Value |
 |---|---|
-| `val/mlm_loss` | 2.075 |
-| `val/perplexity` | 7.96 |
 ---
 ## Downstream Task Performance (Linear Probe)
-Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t12_35M_UR50D` baseline.
-| Task | Vanilla ESM-2 35M | PlantPLM-35M | Δ |
 |---|---|---|---|
-| Subcellular localization (9-class accuracy) | 91.87% | **94.28%** | +2.41% |
-| Subcellular localization (macro-F1) | 92.57% | **94.86%** | +2.29% |
-| GO-term prediction (macro-AUROC, top-50 terms) | 94.26% | **94.82%** | +0.56% |
-*Test set: 332 proteins (localization) · 396 proteins (GO terms) · 9 localization classes · 50 GO terms evaluated.*
 ---
@@ -97,8 +95,8 @@ Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* prote
 from transformers import EsmForMaskedLM, EsmTokenizer
 import torch
-model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-35M")
-tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-35M")
 # --- Masked token prediction ---
 sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
@@ -118,7 +116,7 @@ print(tokenizer.convert_ids_to_tokens(top5.indices.tolist()))
 inputs = tokenizer(sequence, return_tensors="pt")
 with torch.no_grad():
     hidden = model.esm(**inputs).last_hidden_state
-cls_embedding = hidden[0, 0, :]   # shape: [480]
 print("Embedding shape:", cls_embedding.shape)
 ```
@@ -129,19 +127,19 @@ print("Embedding shape:", cls_embedding.shape)
 - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
 - **Plant-specific protein embeddings** — clustering, retrieval, similarity search
 - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
-- **Baseline comparison** — benchmark against PlantPLM-8M / 150M / 650M variants
 ## Out-of-scope Use
-- Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t12_35M_UR50D` for general protein tasks
 - Structural prediction — not trained for structure; use ESMFold for that
 ---
 ## Limitations
-- Trained for only 0.08 passes over the plant corpus (546M / 6.76B tokens) — larger models in this collection see more of the data
-- For highest downstream accuracy, the 150M variant is recommended
 ---
@@ -155,7 +153,7 @@ If you use this model, please cite:
   title        = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
   year         = {2026},
   publisher    = {Hugging Face},
-  howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-35M}},
 }
 ```

   - viridiplantae
   - masked-language-modeling
   - domain-adaptation
+base_model: facebook/esm2_t6_8M_UR50D
 datasets:
   - uniprot-trembl-viridiplantae
 pipeline_tag: fill-mask
 ---
+# PlantPLM-8M
 <img src="Plant_PLM_logo.png" alt="Alt Text" width="800">
+**ESM-2 8M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.**
+This is a domain-adapted version of [`facebook/esm2_t6_8M_UR50D`](https://huggingface.co/facebook/esm2_t6_8M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
 Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
 | Property | Value |
 |---|---|
+| Base model | `facebook/esm2_t6_8M_UR50D` |
+| Architecture | ESM-2 · 6 layers · hidden=320 · heads=20 · FFN=1280 |
 | Position embeddings | Rotary (RoPE) |
 | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
+| Parameters | 7.5M (full-parameter continued pretraining) |
 | Training objective | Masked Language Modeling (MLM, 15% masking) |
 ---
 | Sequences | **19,938,415** protein sequences |
 | Avg sequence length | 339 AA · median 291 AA |
 | Estimated total tokens | **~6.76 billion** amino acid tokens |
+| Tokens seen during training | **800 million** (≈ 0.12 passes over the full dataset) |
 ---
 | Hyperparameter | Value |
 |---|---|
+| Token budget | 800M tokens (training stopped at budget, not epoch end) |
+| Batch size | 64 sequences |
 | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
 | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
 | LR schedule | Linear warmup (500 steps) → linear decay |
 | Gradient clipping | 1.0 |
+| Precision | 16-bit mixed (bf16 activations, fp32 optimizer states) |
 **Final metrics (validation set, 5% holdout):**
 | Metric | Value |
 |---|---|
+| `val/mlm_loss` | 2.292 |
+| `val/perplexity` | 9.92 |
+| `val/masked_token_acc` | 31.0% |
 ---
 ## Downstream Task Performance (Linear Probe)
+Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t6_8M_UR50D` baseline.
+| Task | Vanilla ESM-2 8M | PlantPLM-8M | Δ |
 |---|---|---|---|
+| Subcellular localization (9-class accuracy) | 91.6% | **93.7%** | +2.1% |
+| GO-term prediction (macro-AUROC, top-50 terms) | 94.7% | **95.0%** | +0.3% |
 ---
 from transformers import EsmForMaskedLM, EsmTokenizer
 import torch
+model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-8M")
+tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-8M")
 # --- Masked token prediction ---
 sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
 inputs = tokenizer(sequence, return_tensors="pt")
 with torch.no_grad():
     hidden = model.esm(**inputs).last_hidden_state
+cls_embedding = hidden[0, 0, :]   # shape: [320]
 print("Embedding shape:", cls_embedding.shape)
 ```
 - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
 - **Plant-specific protein embeddings** — clustering, retrieval, similarity search
 - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
+- **Baseline comparison** — benchmark against larger PlantPLM-35M / 150M / 650M variants
 ## Out-of-scope Use
+- Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t6_8M_UR50D` for general protein tasks
 - Structural prediction — not trained for structure; use ESMFold for that
 ---
 ## Limitations
+- Trained for only 0.12 passes over the plant corpus (800M / 6.76B tokens) — larger models in this collection see more of the data
+- 8M capacity limits representation richness; the 35M and 150M variants are recommended for downstream fine-tuning
 ---
   title        = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
   year         = {2026},
   publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-8M}},
 }
 ```

config.json CHANGED Viewed

@@ -10,9 +10,9 @@
   "esmfold_config": null,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.0,
-  "hidden_size": 480,
   "initializer_range": 0.02,
-  "intermediate_size": 1920,
   "is_decoder": false,
   "is_folding_model": false,
   "layer_norm_eps": 1e-05,
@@ -20,7 +20,7 @@
   "max_position_embeddings": 1026,
   "model_type": "esm",
   "num_attention_heads": 20,
-  "num_hidden_layers": 12,
   "pad_token_id": 1,
   "position_embedding_type": "rotary",
   "tie_word_embeddings": true,

   "esmfold_config": null,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.0,
+  "hidden_size": 320,
   "initializer_range": 0.02,
+  "intermediate_size": 1280,
   "is_decoder": false,
   "is_folding_model": false,
   "layer_norm_eps": 1e-05,
   "max_position_embeddings": 1026,
   "model_type": "esm",
   "num_attention_heads": 20,
+  "num_hidden_layers": 6,
   "pad_token_id": 1,
   "position_embedding_type": "rotary",
   "tie_word_embeddings": true,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e3e197185c969c287a2c89d0adc33b577e56ab2058b134132dfae5a827ef0e78
-size 134030384

 version https://git-lfs.github.com/spec/v1
+oid sha256:035f4112003d7c47a40fe93df7b61a3a7dc8e103be122d7588328f363b2dbd0c
+size 30062528