--- license: mit language: en tags: - peptide - biology - drug-discovery - HELM - helm-notation - cyclic-peptide - peptide-language-model pipeline_tag: fill-mask widget: - text: "PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[meL].[meV].[Me_Bmt(E)]}$PEPTIDE1,PEPTIDE1,1:R1-11:R2$$$" --- # HELM-BERT A language model for peptide representation learning using **HELM (Hierarchical Editing Language for Macromolecules)** notation. [![GitHub](https://img.shields.io/badge/GitHub-clinfo%2FHELM--BERT-black?logo=github)](https://github.com/clinfo/HELM-BERT) ## Model Description HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences in HELM notation: - **Disentangled Attention**: Decomposes attention into content-content and content-position terms - **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage - **Span Masking**: Contiguous token masking with geometric distribution - **nGiE**: n-gram Induced Encoding layer (1D convolution, kernel size 3) ## Model Specifications | Parameter | Value | |-----------|-------| | Parameters | 54.8M | | Hidden size | 768 | | Layers | 6 | | Attention heads | 12 | | Vocab size | 78 | | Max token length | 512 | ## How to Use ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("Flansma/helm-bert", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Flansma/helm-bert", trust_remote_code=True) # Cyclosporine A inputs = tokenizer("PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[meL].[meV].[Me_Bmt(E)]}$PEPTIDE1,PEPTIDE1,1:R1-11:R2$$$", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state ``` ## Training Data Pretrained on deduplicated peptide sequences from: - **ChEMBL**: Bioactive molecules database - **CycPeptMPDB**: Cyclic peptide membrane permeability database - **Propedia**: Protein-peptide interaction database ## Citation ```bibtex @article{lee2025helmbert, title={HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction}, author={Seungeon Lee and Takuto Koyama and Itsuki Maeda and Shigeyuki Matsumoto and Yasushi Okuno}, journal={arXiv preprint arXiv:2512.23175}, year={2025}, url={https://arxiv.org/abs/2512.23175} } ``` ## License MIT License