metadata
license: mit
language: en
tags:
- peptide
- biology
- drug-discovery
- HELM
- helm-notation
- cyclic-peptide
- peptide-language-model
pipeline_tag: fill-mask
widget:
- text: PEPTIDE1{A.C.D.E.F}$$$$
HELM-BERT
A language model for peptide representation learning using HELM (Hierarchical Editing Language for Macromolecules) notation.
Model Description
HELM-BERT is a BERT-style encoder designed specifically for peptide sequences in HELM notation. It incorporates several architectural innovations:
- Disentangled Attention: Separate content and position representations (DeBERTa-style)
- Enhanced Mask Decoder (EMD): Absolute position encoding for MLM pretraining
- Span Masking: Contiguous token masking for improved contextual learning
- nGiE: n-gram Induced Encoding layer for local pattern recognition
How to Use
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("Flansma/helm-bert", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Flansma/helm-bert", trust_remote_code=True)
inputs = tokenizer("PEPTIDE1{A.C.D.E.F}$$$$", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
Training Data
Pretrained on deduplicated peptide sequences from:
- ChEMBL: Bioactive molecules database
- CycPeptMPDB: Cyclic peptide membrane permeability database
- Propedia: Protein-peptide interaction database
Citation
@misc{helm-bert,
title={HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction},
author={Seungeon Lee},
year={2025},
url={https://huggingface.co/Flansma/helm-bert}
}
License
MIT License