HELM sequence pre-tokenization
Hi @Flansma , thanks for making helm-bert open. It seems to be a very interesting model.
I wonder if there is an easy way to prepare or pre-tokenize peptide sequences in the helm format. For example, for a peptide with sequence
fPVOLfP-AdaG-OL
I guess it could be something like PEPTIDE1{[fP?].V.[OLeu].[fP].[AdaG].[OLeu]}$$$$, I guess (not sure about the fP). For. single peptide, it might be okay to make the conversion. But for batch inferences, I wonder if there is a easy way to prepare a list of peptides. What would you suggest?
Thanks in advance.
Hi, thanks for your interest!
Unfortunately, there's no easy way to auto-convert peptide sequences to HELM format. Non-natural amino acid naming conventions aren't standardized, so building a universal converter is tricky.
For your example fPVOLfP-AdaG-OL, your guess looks roughly correct if fP refers to D-Proline. I'd recommend using monomer names that exist in the ChEMBL monomer library or CycPeptMPDB (http://cycpeptmpdb.com/).
For peptides composed only of natural amino acids, you can use RDKit for the conversion.
For more details, please refer to the paper (https://arxiv.org/abs/2512.23175). I'm planning to release the training code along with the preprocessing and fine-tuning datasets soon. I'll update this thread with the link once it's available!
Update (Jan 2, 2026): The repository is now public! You can access the code here: https://github.com/clinfo/HELM-BERT
Feel free to ask if you have any other questions.
Happy New Year!