Peptide-trained Chemical Language Model using 10.8M peptides and 12.6M small molecules for MLM pretraining.

Loading the tokenizer is not possible with transformers. A custom tokenizer must be loaded from the 'tokenizer' directory found at at https://github.com/AaronFeller/PeptideCLM

An example script for this can be found in the repository. A short example is below (note, the tokenizer directory must be downloaded):

from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer

def get_tokenizer():
    vocab_file = 'tokenizer/new_vocab.txt'
    splits_file = 'tokenizer/new_splits.txt'
    tokenizer = SMILES_SPE_Tokenizer(vocab_file, splits_file)
    return tokenizer

Downloads last month: 7,667

Spaces using aaronfeller/PeptideCLM-23M-all 2

Collection including aaronfeller/PeptideCLM-23M-all

PeptideCLM

Collection

An improved version of PeptideCLM is availabe at https://huggingface.co/collections/aaronfeller/peptideclm-2 • 3 items • Updated May 29