|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- VTSNLP/vietnamese_curated_dataset |
|
|
language: |
|
|
- vi |
|
|
tags: |
|
|
- tokenizer |
|
|
- vietnamese |
|
|
- byte-bpe |
|
|
- causal-lm |
|
|
- nlp |
|
|
--- |
|
|
# Vietnamese Tokenizer |
|
|
|
|
|
This repository contains a **ByteLevel BPE tokenizer** trained **from scratch** specifically for the **Vietnamese language**, designed for **decoder-only language model pretraining**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
### Load tokenizer |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"tranhuyHoang/mini_VN_decoder_tokenizer", |
|
|
use_fast=True |
|
|
) |
|
|
|