HELM sequence pre-tokenization

by alessandronascimento - opened Dec 30, 2025

Dec 30, 2025

Hi @Flansma , thanks for making helm-bert open. It seems to be a very interesting model.

I wonder if there is an easy way to prepare or pre-tokenize peptide sequences in the helm format. For example, for a peptide with sequence

fPVOLfP-AdaG-OL

I guess it could be something like PEPTIDE1{[fP?].V.[OLeu].[fP].[AdaG].[OLeu]}$$$$, I guess (not sure about the fP). For. single peptide, it might be okay to make the conversion. But for batch inferences, I wonder if there is a easy way to prepare a list of peptides. What would you suggest?

Thanks in advance.

Flansma

Owner Jan 1

•

edited Jan 2

Hi, thanks for your interest!

Unfortunately, there's no easy way to auto-convert peptide sequences to HELM format. Non-natural amino acid naming conventions aren't standardized, so building a universal converter is tricky.

For your example fPVOLfP-AdaG-OL, your guess looks roughly correct if fP refers to D-Proline. I'd recommend using monomer names that exist in the ChEMBL monomer library or CycPeptMPDB (http://cycpeptmpdb.com/).

For peptides composed only of natural amino acids, you can use RDKit for the conversion.

For more details, please refer to the paper (https://arxiv.org/abs/2512.23175). I'm planning to release the training code along with the preprocessing and fine-tuning datasets soon. I'll update this thread with the link once it's available!

Update (Jan 2, 2026): The repository is now public! You can access the code here: https://github.com/clinfo/HELM-BERT

Feel free to ask if you have any other questions.

Happy New Year!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment