tokenization
Collection
This is a set of models used for experiments on the differences in bpe in tokenizers trained on domain specific genomes. โข 8 items โข Updated
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
This is a specialized BPE tokenizer trained on human DNA sequences and has a vocabulary size of 4096.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("leannmlindsey/hg38-bpe-v4096")
# Example usage
sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
encoded = tokenizer(sequences)
This tokenizer was trained using the HuggingFace Tokenizers library with the BPE algorithm trained on the Hg38 Human Reference Genome.