gsaltintas commited on
Commit
8a356bf
·
verified ·
1 Parent(s): 1719c64

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +49 -0
  2. merges.txt +0 -0
  3. special_tokens_map.json +5 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +0 -0
  6. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - cmn # ISO 639-3 code or "und" if not identifiable
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - flexitok
9
+ - fineweb2
10
+ ---
11
+
12
+ # Byte-Level BPE Tokenizer: cmn_Hani (16K)
13
+
14
+ A **Byte-Level BPE** tokenizer trained on **cmn_Hani** data from Fineweb-2-HQ.
15
+
16
+ ## Training Details
17
+
18
+ | Parameter | Value |
19
+ |-----------|-------|
20
+ | Algorithm | Byte-Level BPE |
21
+ | Language | `cmn_Hani` |
22
+ | Target Vocab Size | 16,000 |
23
+ | Final Vocab Size | 19,127 |
24
+ | Pre-tokenizer | custom:cmn_Hani |
25
+ | Number handling | ltr_3digit |
26
+ | Contraction handling | True |
27
+ | Normalizer | NFC |
28
+ | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
29
+ | Training Shards | 2 |
30
+
31
+ ## Usage
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_script_CmJp_16000")
37
+ tokens = tokenizer.encode("Hello, world!")
38
+ ```
39
+
40
+ ## Files
41
+
42
+ - `tokenizer.json` — Full HuggingFace tokenizer
43
+ - `vocab.json` — Vocabulary mapping
44
+ - `merges.txt` — BPE merge rules
45
+
46
+ ## Sample Encoding
47
+ | Text | Tokens | Token IDs |
48
+ |------|--------|-----------|
49
+ | `Hello, world! 12345 This is a test. こんにちは` | `H, ell, o, ,, Ġw, orld, !, Ġ, 123, 45, ĠTh, is, Ġis, Ġa, Ġt, est, ., Ġ, ãģ, ĵ` | `42, 3452, 81, 14, 2927, 10693, 3, 223, 12071, 3557, 9838, 1476, 6231, 3791, 4071, 3540, 16, 223, 3577, 244` |
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "<pad>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff