Plant Protein BERT

A BERT-based protein language model pretrained on plant protein sequences from UniProt/Swiss-Prot using Masked Language Modeling (MLM).

Model Details

Property Value
Architecture BERT (encoder-only transformer)
Model Type Encoder (Embeddings)
Parameters 19,190,272 (19.2M)
Hidden Dim 512
Layers 6
Attention Heads 8
MLP Ratio 4x
Max Sequence Length 512
Vocabulary Size 25 (20 amino acids + 5 special tokens)
Dropout 0.1

Training Details

Property Value
Training Data UniProt/Swiss-Prot plant protein sequences
Pretraining Objective Masked Language Modeling (15% masking)
Checkpoint Epoch 4
Global Step 30
Best Validation Loss N/A
Precision FP16 (AMP)
Framework PyTorch + HuggingFace Transformers

Usage

Generate Protein Embeddings

from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model = AutoModel.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model.eval()

# Encode a protein sequence
sequence = "MKTLLSGGVVITQGIVAALAVAAYKSPAMDLYRFGPQNVLDAEQATRKA"
inputs = tokenizer(sequence, return_tensors="pt", padding="max_length", max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

# CLS embedding (recommended for whole-sequence tasks)
cls_embedding = outputs.last_hidden_state[:, 0, :]  # shape: (1, 512)

# Mean pooling (alternative)
mask = inputs["attention_mask"].unsqueeze(-1).float()
mean_embedding = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)

print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Mean embedding shape: {mean_embedding.shape}")

Vocabulary

The tokenizer uses a character-level vocabulary of 25 tokens:

Token ID Token ID
[PAD] 0 [CLS] 1
[SEP] 2 [MASK] 3
[UNK] 4 A 5
C 6 D 7
E 8 F 9
G 10 H 11
I 12 K 13
L 14 M 15
N 16 P 17
Q 18 R 19
S 20 T 21
V 22 W 23
Y 24

Intended Use

This model is designed for:

  • Protein embedding generation for downstream tasks (classification, clustering, similarity search)
  • Masked residue prediction for protein engineering and variant effect prediction
  • Transfer learning for plant-specific protein tasks

Limitations

  • Trained primarily on plant protein sequences — may not generalize well to non-plant organisms
  • Character-level tokenization (no BPE) — each amino acid is a single token
  • Maximum sequence length of 512 tokens (including special tokens)

Citation

@misc{plant-protein-bert,
  title={Plant Protein BERT: A Protein Language Model for Plant Proteomics},
  year={2026},
  publisher={HuggingFace},
}
Downloads last month
48
Safetensors
Model size
19.2M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support