Plant Protein BERT
A BERT-based protein language model pretrained on plant protein sequences from UniProt/Swiss-Prot using Masked Language Modeling (MLM).
Model Details
| Property | Value |
|---|---|
| Architecture | BERT (encoder-only transformer) |
| Model Type | Encoder (Embeddings) |
| Parameters | 19,190,272 (19.2M) |
| Hidden Dim | 512 |
| Layers | 6 |
| Attention Heads | 8 |
| MLP Ratio | 4x |
| Max Sequence Length | 512 |
| Vocabulary Size | 25 (20 amino acids + 5 special tokens) |
| Dropout | 0.1 |
Training Details
| Property | Value |
|---|---|
| Training Data | UniProt/Swiss-Prot plant protein sequences |
| Pretraining Objective | Masked Language Modeling (15% masking) |
| Checkpoint Epoch | 4 |
| Global Step | 30 |
| Best Validation Loss | N/A |
| Precision | FP16 (AMP) |
| Framework | PyTorch + HuggingFace Transformers |
Usage
Generate Protein Embeddings
from transformers import AutoModel, AutoTokenizer
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model = AutoModel.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model.eval()
# Encode a protein sequence
sequence = "MKTLLSGGVVITQGIVAALAVAAYKSPAMDLYRFGPQNVLDAEQATRKA"
inputs = tokenizer(sequence, return_tensors="pt", padding="max_length", max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# CLS embedding (recommended for whole-sequence tasks)
cls_embedding = outputs.last_hidden_state[:, 0, :] # shape: (1, 512)
# Mean pooling (alternative)
mask = inputs["attention_mask"].unsqueeze(-1).float()
mean_embedding = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Mean embedding shape: {mean_embedding.shape}")
Vocabulary
The tokenizer uses a character-level vocabulary of 25 tokens:
| Token | ID | Token | ID |
|---|---|---|---|
[PAD] |
0 | [CLS] |
1 |
[SEP] |
2 | [MASK] |
3 |
[UNK] |
4 | A |
5 |
C |
6 | D |
7 |
E |
8 | F |
9 |
G |
10 | H |
11 |
I |
12 | K |
13 |
L |
14 | M |
15 |
N |
16 | P |
17 |
Q |
18 | R |
19 |
S |
20 | T |
21 |
V |
22 | W |
23 |
Y |
24 |
Intended Use
This model is designed for:
- Protein embedding generation for downstream tasks (classification, clustering, similarity search)
- Masked residue prediction for protein engineering and variant effect prediction
- Transfer learning for plant-specific protein tasks
Limitations
- Trained primarily on plant protein sequences — may not generalize well to non-plant organisms
- Character-level tokenization (no BPE) — each amino acid is a single token
- Maximum sequence length of 512 tokens (including special tokens)
Citation
@misc{plant-protein-bert,
title={Plant Protein BERT: A Protein Language Model for Plant Proteomics},
year={2026},
publisher={HuggingFace},
}
- Downloads last month
- 48