HELM sequence pre-tokenization

#1
by alessandronascimento - opened

Hi @Flansma , thanks for making helm-bert open. It seems to be a very interesting model.

I wonder if there is an easy way to prepare or pre-tokenize peptide sequences in the helm format. For example, for a peptide with sequence

fPVOLfP-AdaG-OL

I guess it could be something like PEPTIDE1{[fP?].V.[OLeu].[fP].[AdaG].[OLeu]}$$$$, I guess (not sure about the fP). For. single peptide, it might be okay to make the conversion. But for batch inferences, I wonder if there is a easy way to prepare a list of peptides. What would you suggest?

Thanks in advance.

Hi, thanks for your interest!

Unfortunately, there's no easy way to auto-convert peptide sequences to HELM format. Non-natural amino acid naming conventions aren't standardized, so building a universal converter is tricky.

For your example fPVOLfP-AdaG-OL, your guess looks roughly correct if fP refers to D-Proline. I'd recommend using monomer names that exist in the ChEMBL monomer library or CycPeptMPDB (http://cycpeptmpdb.com/).

For peptides composed only of natural amino acids, you can use RDKit for the conversion.

For more details, please refer to the paper (https://arxiv.org/abs/2512.23175). I'm planning to release the training code along with the preprocessing and fine-tuning datasets soon. I'll update this thread with the link once it's available!

Update (Jan 2, 2026): The repository is now public! You can access the code here: https://github.com/clinfo/HELM-BERT

Feel free to ask if you have any other questions.

Happy New Year!

Sign up or log in to comment