tokenization
Collection
This is a set of models used for experiments on the differences in bpe in tokenizers trained on domain specific genomes. • 8 items • Updated
This is a specialized WordPiece tokenizer trained on human DNA sequences and has a vocabulary size of 4096.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("leannmlindsey/hg38-wpc-v4096")
# Example usage
sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
encoded = tokenizer(sequences)
This tokenizer was trained using the HuggingFace Tokenizers library with the WordPiece algorithm trained on the Hg38 Human Reference Genome.