tranhuyHoang commited on
Commit
0901c40
·
verified ·
1 Parent(s): fb3b6e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -3
README.md CHANGED
@@ -1,3 +1,30 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - VTSNLP/vietnamese_curated_dataset
5
+ language:
6
+ - vi
7
+ tags:
8
+ - tokenizer
9
+ - vietnamese
10
+ - byte-bpe
11
+ - causal-lm
12
+ - nlp
13
+ ---
14
+ # Vietnamese Tokenizer
15
+
16
+ This repository contains a **ByteLevel BPE tokenizer** trained **from scratch** specifically for the **Vietnamese language**, designed for **decoder-only language model pretraining**.
17
+
18
+ ---
19
+
20
+ ## 🚀 Usage
21
+
22
+ ### Load tokenizer
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained(
28
+ "tranhuyHoang/mini_VN_decoder_tokenizer",
29
+ use_fast=True
30
+ )