Plant Protein BERT

A BERT-based protein language model pretrained on plant protein sequences from UniProt/Swiss-Prot using Masked Language Modeling (MLM).

Model Details

Property	Value
Architecture	BERT (encoder-only transformer)
Model Type	Encoder (Embeddings)
Parameters	19,190,272 (19.2M)
Hidden Dim	512
Layers	6
Attention Heads	8
MLP Ratio	4x
Max Sequence Length	512
Vocabulary Size	25 (20 amino acids + 5 special tokens)
Dropout	0.1

Training Details

Property	Value
Training Data	UniProt/Swiss-Prot plant protein sequences
Pretraining Objective	Masked Language Modeling (15% masking)
Checkpoint Epoch	4
Global Step	30
Best Validation Loss	N/A
Precision	FP16 (AMP)
Framework	PyTorch + HuggingFace Transformers

Usage

Generate Protein Embeddings

from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model = AutoModel.from_pretrained("dipayan26/plant-protein-bert", trust_remote_code=True)
model.eval()

# Encode a protein sequence
sequence = "MKTLLSGGVVITQGIVAALAVAAYKSPAMDLYRFGPQNVLDAEQATRKA"
inputs = tokenizer(sequence, return_tensors="pt", padding="max_length", max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

# CLS embedding (recommended for whole-sequence tasks)
cls_embedding = outputs.last_hidden_state[:, 0, :]  # shape: (1, 512)

# Mean pooling (alternative)
mask = inputs["attention_mask"].unsqueeze(-1).float()
mean_embedding = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)

print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Mean embedding shape: {mean_embedding.shape}")

Vocabulary

The tokenizer uses a character-level vocabulary of 25 tokens:

Token	ID	Token	ID
`[PAD]`	0	`[CLS]`	1
`[SEP]`	2	`[MASK]`	3
`[UNK]`	4	`A`	5
`C`	6	`D`	7
`E`	8	`F`	9
`G`	10	`H`	11
`I`	12	`K`	13
`L`	14	`M`	15
`N`	16	`P`	17
`Q`	18	`R`	19
`S`	20	`T`	21
`V`	22	`W`	23
`Y`	24

Intended Use

This model is designed for:

Protein embedding generation for downstream tasks (classification, clustering, similarity search)
Masked residue prediction for protein engineering and variant effect prediction
Transfer learning for plant-specific protein tasks

Limitations

Trained primarily on plant protein sequences — may not generalize well to non-plant organisms
Character-level tokenization (no BPE) — each amino acid is a single token
Maximum sequence length of 512 tokens (including special tokens)

Citation

@misc{plant-protein-bert,
  title={Plant Protein BERT: A Protein Language Model for Plant Proteomics},
  year={2026},
  publisher={HuggingFace},
}

Downloads last month: 48

Safetensors

Model size

19.2M params

Tensor type

I64

F32