Vietnamese BPE Tokenizer (32k vocab)
BPE tokenizer trained from scratch on ~4M lines of Vietnamese Wikipedia.
| Parameter | Value |
|---|---|
| Vocab size | 32,000 |
| Model | BPE (Byte-Pair Encoding) |
| Pre-tokenizer | ByteLevel |
| Normalizer | NFC Unicode |
| min_frequency | 3 |
| Training data | Vietnamese Wikipedia (~4M lines) |
Special Tokens
| Token | ID |
|---|---|
[UNK] |
0 |
[PAD] |
1 |
[BOS] |
2 |
[EOS] |
3 |
[MASK] |
4 |
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("KienCAS/vietnamese-bpe-32k")
encoded = tokenizer("Tiếng Việt là ngôn ngữ của người Việt.", return_tensors="pt")
decoded = tokenizer.decode(encoded['input_ids'][0], skip_special_tokens=True)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support