dipayan26 commited on
Commit
ec78ee0
·
verified ·
1 Parent(s): 63c9e64

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +26 -28
  2. config.json +3 -3
  3. model.safetensors +2 -2
README.md CHANGED
@@ -10,21 +10,21 @@ tags:
10
  - viridiplantae
11
  - masked-language-modeling
12
  - domain-adaptation
13
- base_model: facebook/esm2_t12_35M_UR50D
14
  datasets:
15
  - uniprot-trembl-viridiplantae
16
  pipeline_tag: fill-mask
17
  ---
18
 
19
- # PlantPLM-35M
20
 
21
  <img src="Plant_PLM_logo.png" alt="Alt Text" width="800">
22
 
23
 
24
 
25
- **ESM-2 35M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.**
26
 
27
- This is a domain-adapted version of [`facebook/esm2_t12_35M_UR50D`](https://huggingface.co/facebook/esm2_t12_35M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
28
 
29
  Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
30
 
@@ -34,11 +34,11 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
34
 
35
  | Property | Value |
36
  |---|---|
37
- | Base model | `facebook/esm2_t12_35M_UR50D` |
38
- | Architecture | ESM-2 · 12 layers · hidden=480 · heads=20 · FFN=1920 |
39
  | Position embeddings | Rotary (RoPE) |
40
  | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
41
- | Parameters | 33.5M (full-parameter continued pretraining) |
42
  | Training objective | Masked Language Modeling (MLM, 15% masking) |
43
 
44
  ---
@@ -51,7 +51,7 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
51
  | Sequences | **19,938,415** protein sequences |
52
  | Avg sequence length | 339 AA · median 291 AA |
53
  | Estimated total tokens | **~6.76 billion** amino acid tokens |
54
- | Tokens seen during training | **546 million** (≈ 0.08 passes over the full dataset) |
55
 
56
  ---
57
 
@@ -59,35 +59,33 @@ Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm
59
 
60
  | Hyperparameter | Value |
61
  |---|---|
62
- | Training steps | 55,000 optimizer steps |
63
- | Batch size | 64 sequences (32 per micro-batch × 2 gradient accumulation steps) |
64
  | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
65
  | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
66
  | LR schedule | Linear warmup (500 steps) → linear decay |
67
  | Gradient clipping | 1.0 |
68
- | Precision | 16-bit mixed (fp16 activations, fp32 optimizer states) |
69
 
70
 
71
  **Final metrics (validation set, 5% holdout):**
72
 
73
  | Metric | Value |
74
  |---|---|
75
- | `val/mlm_loss` | 2.075 |
76
- | `val/perplexity` | 7.96 |
 
77
 
78
  ---
79
 
80
  ## Downstream Task Performance (Linear Probe)
81
 
82
- Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t12_35M_UR50D` baseline.
83
 
84
- | Task | Vanilla ESM-2 35M | PlantPLM-35M | Δ |
85
  |---|---|---|---|
86
- | Subcellular localization (9-class accuracy) | 91.87% | **94.28%** | +2.41% |
87
- | Subcellular localization (macro-F1) | 92.57% | **94.86%** | +2.29% |
88
- | GO-term prediction (macro-AUROC, top-50 terms) | 94.26% | **94.82%** | +0.56% |
89
-
90
- *Test set: 332 proteins (localization) · 396 proteins (GO terms) · 9 localization classes · 50 GO terms evaluated.*
91
 
92
  ---
93
 
@@ -97,8 +95,8 @@ Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* prote
97
  from transformers import EsmForMaskedLM, EsmTokenizer
98
  import torch
99
 
100
- model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-35M")
101
- tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-35M")
102
 
103
  # --- Masked token prediction ---
104
  sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
@@ -118,7 +116,7 @@ print(tokenizer.convert_ids_to_tokens(top5.indices.tolist()))
118
  inputs = tokenizer(sequence, return_tensors="pt")
119
  with torch.no_grad():
120
  hidden = model.esm(**inputs).last_hidden_state
121
- cls_embedding = hidden[0, 0, :] # shape: [480]
122
  print("Embedding shape:", cls_embedding.shape)
123
  ```
124
 
@@ -129,19 +127,19 @@ print("Embedding shape:", cls_embedding.shape)
129
  - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
130
  - **Plant-specific protein embeddings** — clustering, retrieval, similarity search
131
  - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
132
- - **Baseline comparison** — benchmark against PlantPLM-8M / 150M / 650M variants
133
 
134
  ## Out-of-scope Use
135
 
136
- - Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t12_35M_UR50D` for general protein tasks
137
  - Structural prediction — not trained for structure; use ESMFold for that
138
 
139
  ---
140
 
141
  ## Limitations
142
 
143
- - Trained for only 0.08 passes over the plant corpus (546M / 6.76B tokens) — larger models in this collection see more of the data
144
- - For highest downstream accuracy, the 150M variant is recommended
145
 
146
  ---
147
 
@@ -155,7 +153,7 @@ If you use this model, please cite:
155
  title = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
156
  year = {2026},
157
  publisher = {Hugging Face},
158
- howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-35M}},
159
  }
160
  ```
161
 
 
10
  - viridiplantae
11
  - masked-language-modeling
12
  - domain-adaptation
13
+ base_model: facebook/esm2_t6_8M_UR50D
14
  datasets:
15
  - uniprot-trembl-viridiplantae
16
  pipeline_tag: fill-mask
17
  ---
18
 
19
+ # PlantPLM-8M
20
 
21
  <img src="Plant_PLM_logo.png" alt="Alt Text" width="800">
22
 
23
 
24
 
25
+ **ESM-2 8M parameter model continued-pretrained on Viridiplantae (plant) protein sequences.**
26
 
27
+ This is a domain-adapted version of [`facebook/esm2_t6_8M_UR50D`](https://huggingface.co/facebook/esm2_t6_8M_UR50D), fine-tuned on a curated subset of UniProt TrEMBL containing only plant-kingdom proteins. The adaptation improves representation quality for plant-specific protein tasks compared to the general-purpose ESM-2 baseline.
28
 
29
  Part of the **[Plant-PLM](https://huggingface.co/collections/dipayan26/plant-plm)** - ESM-2 models at 8M, 35M, 150M, and 650M parameters, each adapted on the same plant protein corpus.
30
 
 
34
 
35
  | Property | Value |
36
  |---|---|
37
+ | Base model | `facebook/esm2_t6_8M_UR50D` |
38
+ | Architecture | ESM-2 · 6 layers · hidden=320 · heads=20 · FFN=1280 |
39
  | Position embeddings | Rotary (RoPE) |
40
  | Vocabulary | 33 tokens (20 standard + rare amino acids + special tokens) |
41
+ | Parameters | 7.5M (full-parameter continued pretraining) |
42
  | Training objective | Masked Language Modeling (MLM, 15% masking) |
43
 
44
  ---
 
51
  | Sequences | **19,938,415** protein sequences |
52
  | Avg sequence length | 339 AA · median 291 AA |
53
  | Estimated total tokens | **~6.76 billion** amino acid tokens |
54
+ | Tokens seen during training | **800 million** (≈ 0.12 passes over the full dataset) |
55
 
56
  ---
57
 
 
59
 
60
  | Hyperparameter | Value |
61
  |---|---|
62
+ | Token budget | 800M tokens (training stopped at budget, not epoch end) |
63
+ | Batch size | 64 sequences |
64
  | Optimizer | AdamW · β=(0.9, 0.98) · ε=1e-8 · weight_decay=0.01 |
65
  | Learning rate | 2e-5 (20× lower than ESM-2 from-scratch to prevent catastrophic forgetting) |
66
  | LR schedule | Linear warmup (500 steps) → linear decay |
67
  | Gradient clipping | 1.0 |
68
+ | Precision | 16-bit mixed (bf16 activations, fp32 optimizer states) |
69
 
70
 
71
  **Final metrics (validation set, 5% holdout):**
72
 
73
  | Metric | Value |
74
  |---|---|
75
+ | `val/mlm_loss` | 2.292 |
76
+ | `val/perplexity` | 9.92 |
77
+ | `val/masked_token_acc` | 31.0% |
78
 
79
  ---
80
 
81
  ## Downstream Task Performance (Linear Probe)
82
 
83
+ Frozen [CLS] embeddings evaluated on 2,000 reviewed *Arabidopsis thaliana* proteins from UniProt SwissProt using a logistic regression linear probe. Compared against the vanilla `facebook/esm2_t6_8M_UR50D` baseline.
84
 
85
+ | Task | Vanilla ESM-2 8M | PlantPLM-8M | Δ |
86
  |---|---|---|---|
87
+ | Subcellular localization (9-class accuracy) | 91.6% | **93.7%** | +2.1% |
88
+ | GO-term prediction (macro-AUROC, top-50 terms) | 94.7% | **95.0%** | +0.3% |
 
 
 
89
 
90
  ---
91
 
 
95
  from transformers import EsmForMaskedLM, EsmTokenizer
96
  import torch
97
 
98
+ model = EsmForMaskedLM.from_pretrained("dipayan26/PlantPLM-8M")
99
+ tokenizer = EsmTokenizer.from_pretrained("dipayan26/PlantPLM-8M")
100
 
101
  # --- Masked token prediction ---
102
  sequence = "MSPQTETKASVGFKAGVKDYKLTYYTPEYETK"
 
116
  inputs = tokenizer(sequence, return_tensors="pt")
117
  with torch.no_grad():
118
  hidden = model.esm(**inputs).last_hidden_state
119
+ cls_embedding = hidden[0, 0, :] # shape: [320]
120
  print("Embedding shape:", cls_embedding.shape)
121
  ```
122
 
 
127
  - **Plant protein function prediction** — GO term annotation, subcellular localization, signal peptide detection
128
  - **Plant-specific protein embeddings** — clustering, retrieval, similarity search
129
  - **Transfer learning starting point** — fine-tune on small labeled plant protein datasets
130
+ - **Baseline comparison** — benchmark against larger PlantPLM-35M / 150M / 650M variants
131
 
132
  ## Out-of-scope Use
133
 
134
+ - Non-plant organisms — the model has been shifted toward Viridiplantae statistics; use the original `facebook/esm2_t6_8M_UR50D` for general protein tasks
135
  - Structural prediction — not trained for structure; use ESMFold for that
136
 
137
  ---
138
 
139
  ## Limitations
140
 
141
+ - Trained for only 0.12 passes over the plant corpus (800M / 6.76B tokens) — larger models in this collection see more of the data
142
+ - 8M capacity limits representation richness; the 35M and 150M variants are recommended for downstream fine-tuning
143
 
144
  ---
145
 
 
153
  title = {PlantPLM: Domain-Adaptive Pretraining of ESM-2 on Viridiplantae Proteins},
154
  year = {2026},
155
  publisher = {Hugging Face},
156
+ howpublished = {\url{https://huggingface.co/dipayan26/PlantPLM-8M}},
157
  }
158
  ```
159
 
config.json CHANGED
@@ -10,9 +10,9 @@
10
  "esmfold_config": null,
11
  "hidden_act": "gelu",
12
  "hidden_dropout_prob": 0.0,
13
- "hidden_size": 480,
14
  "initializer_range": 0.02,
15
- "intermediate_size": 1920,
16
  "is_decoder": false,
17
  "is_folding_model": false,
18
  "layer_norm_eps": 1e-05,
@@ -20,7 +20,7 @@
20
  "max_position_embeddings": 1026,
21
  "model_type": "esm",
22
  "num_attention_heads": 20,
23
- "num_hidden_layers": 12,
24
  "pad_token_id": 1,
25
  "position_embedding_type": "rotary",
26
  "tie_word_embeddings": true,
 
10
  "esmfold_config": null,
11
  "hidden_act": "gelu",
12
  "hidden_dropout_prob": 0.0,
13
+ "hidden_size": 320,
14
  "initializer_range": 0.02,
15
+ "intermediate_size": 1280,
16
  "is_decoder": false,
17
  "is_folding_model": false,
18
  "layer_norm_eps": 1e-05,
 
20
  "max_position_embeddings": 1026,
21
  "model_type": "esm",
22
  "num_attention_heads": 20,
23
+ "num_hidden_layers": 6,
24
  "pad_token_id": 1,
25
  "position_embedding_type": "rotary",
26
  "tie_word_embeddings": true,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e3e197185c969c287a2c89d0adc33b577e56ab2058b134132dfae5a827ef0e78
3
- size 134030384
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:035f4112003d7c47a40fe93df7b61a3a7dc8e103be122d7588328f363b2dbd0c
3
+ size 30062528