Vietnamese BPE Tokenizer (32k vocab)

BPE tokenizer trained from scratch on ~4M lines of Vietnamese Wikipedia.

Parameter Value
Vocab size 32,000
Model BPE (Byte-Pair Encoding)
Pre-tokenizer ByteLevel
Normalizer NFC Unicode
min_frequency 3
Training data Vietnamese Wikipedia (~4M lines)

Special Tokens

Token ID
[UNK] 0
[PAD] 1
[BOS] 2
[EOS] 3
[MASK] 4

Usage

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("KienCAS/vietnamese-bpe-32k")
encoded = tokenizer("Tiếng Việt là ngôn ngữ của người Việt.", return_tensors="pt")
decoded = tokenizer.decode(encoded['input_ids'][0], skip_special_tokens=True)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support