This is a specialized WordPiece tokenizer trained on human DNA sequences and has a vocabulary size of 4096.

Special Tokens

PAD: [PAD]
UNK: [UNK]
CLS: [CLS]
SEP: [SEP]
MASK: [MASK]

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("leannmlindsey/hg38-wpc-v4096")

# Example usage
sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
encoded = tokenizer(sequences)

Training Information

This tokenizer was trained using the HuggingFace Tokenizers library with the WordPiece algorithm trained on the Hg38 Human Reference Genome.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including leannmlindsey/hg38-wpc-v4096

tokenization

Collection

This is a set of models used for experiments on the differences in bpe in tokenizers trained on domain specific genomes. • 8 items • Updated May 15, 2025