|
|
--- |
|
|
license: mit |
|
|
language: en |
|
|
tags: |
|
|
- peptide |
|
|
- biology |
|
|
- drug-discovery |
|
|
- HELM |
|
|
- helm-notation |
|
|
- cyclic-peptide |
|
|
- peptide-language-model |
|
|
pipeline_tag: fill-mask |
|
|
widget: |
|
|
- text: "PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[meL].[meV].[Me_Bmt(E)]}$PEPTIDE1,PEPTIDE1,1:R1-11:R2$$$" |
|
|
--- |
|
|
|
|
|
# HELM-BERT |
|
|
|
|
|
A language model for peptide representation learning using **HELM (Hierarchical Editing Language for Macromolecules)** notation. |
|
|
|
|
|
[](https://github.com/clinfo/HELM-BERT) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences in HELM notation: |
|
|
|
|
|
- **Disentangled Attention**: Decomposes attention into content-content and content-position terms |
|
|
- **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage |
|
|
- **Span Masking**: Contiguous token masking with geometric distribution |
|
|
- **nGiE**: n-gram Induced Encoding layer (1D convolution, kernel size 3) |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Parameters | 54.8M | |
|
|
| Hidden size | 768 | |
|
|
| Layers | 6 | |
|
|
| Attention heads | 12 | |
|
|
| Vocab size | 78 | |
|
|
| Max token length | 512 | |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("Flansma/helm-bert", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Flansma/helm-bert", trust_remote_code=True) |
|
|
|
|
|
# Cyclosporine A |
|
|
inputs = tokenizer("PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[meL].[meV].[Me_Bmt(E)]}$PEPTIDE1,PEPTIDE1,1:R1-11:R2$$$", return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Pretrained on deduplicated peptide sequences from: |
|
|
- **ChEMBL**: Bioactive molecules database |
|
|
- **CycPeptMPDB**: Cyclic peptide membrane permeability database |
|
|
- **Propedia**: Protein-peptide interaction database |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{lee2025helmbert, |
|
|
title={HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction}, |
|
|
author={Seungeon Lee and Takuto Koyama and Itsuki Maeda and Shigeyuki Matsumoto and Yasushi Okuno}, |
|
|
journal={arXiv preprint arXiv:2512.23175}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/abs/2512.23175} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|