gsaltintas commited on
Commit
9e9f82f
·
verified ·
1 Parent(s): 9a26c96

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +8 -7
  2. merges.txt +2 -2
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +0 -0
  5. vocab.json +0 -0
README.md CHANGED
@@ -1,7 +1,8 @@
1
  ---
2
  license: mit
3
  language:
4
- - cmn # ISO 639-3 code or "und" if not identifiable
 
5
  tags:
6
  - tokenizer
7
  - bpe
@@ -9,24 +10,24 @@ tags:
9
  - fineweb2
10
  ---
11
 
12
- # Byte-Level BPE Tokenizer: cmn_Hani (16K)
13
 
14
- A **Byte-Level BPE** tokenizer trained on **cmn_Hani** data from Fineweb-2-HQ.
15
 
16
  ## Training Details
17
 
18
  | Parameter | Value |
19
  |-----------|-------|
20
  | Algorithm | Byte-Level BPE |
21
- | Language | `cmn_Hani` |
22
  | Target Vocab Size | 16,000 |
23
- | Final Vocab Size | 19,127 |
24
- | Pre-tokenizer | custom:cmn_Hani |
25
  | Number handling | ltr_3digit |
26
  | Contraction handling | True |
27
  | Normalizer | NFC |
28
  | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
29
- | Training Shards | 2 |
30
 
31
  ## Usage
32
 
 
1
  ---
2
  license: mit
3
  language:
4
+ - cmn #
5
+ - jpn #['cmn_Hani', 'jpn_Jpan'] # ISO 639-3 code or "und" if not identifiable
6
  tags:
7
  - tokenizer
8
  - bpe
 
10
  - fineweb2
11
  ---
12
 
13
+ # Byte-Level BPE Tokenizer: ['cmn_Hani', 'jpn_Jpan'] (16K)
14
 
15
+ A **Byte-Level BPE** tokenizer trained on **['cmn_Hani', 'jpn_Jpan']** data from Fineweb-2-HQ.
16
 
17
  ## Training Details
18
 
19
  | Parameter | Value |
20
  |-----------|-------|
21
  | Algorithm | Byte-Level BPE |
22
+ | Language | `['cmn_Hani', 'jpn_Jpan']` |
23
  | Target Vocab Size | 16,000 |
24
+ | Final Vocab Size | 17,900 |
25
+ | Pre-tokenizer | custom:jpn_Jpan |
26
  | Number handling | ltr_3digit |
27
  | Contraction handling | True |
28
  | Normalizer | NFC |
29
  | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
30
+ | Training Shards | 4 |
31
 
32
  ## Usage
33
 
merges.txt CHANGED
@@ -5926,9 +5926,9 @@
5926
  ['æĮĩ', 'åįĹ']
5927
  ['个', 'ä½ĵ']
5928
  ['èĥ', 'ļ']
 
5929
  ['å°¼', 'æĸ¯']
5930
  ['å¿ĺ', 'è®°']
5931
- ['Â', '°']
5932
  ['èĬ±', 'è´¹']
5933
  ['æİ¥', 'ç§į']
5934
  ['ç»Ī', 'æŃ¢']
@@ -10661,8 +10661,8 @@
10661
  ['r', 'ic']
10662
  ['åı²', 'å¯Ĩ']
10663
  ['åIJ¯', 'åıij']
10664
- ['å·²', 'çŁ¥']
10665
  ['Â', '®']
 
10666
  ['å¤Ħ', 'åľ¨']
10667
  ['çļĦ', 'æĿ¡ä»¶']
10668
  ['äºĮ', '级']
 
5926
  ['æĮĩ', 'åįĹ']
5927
  ['个', 'ä½ĵ']
5928
  ['èĥ', 'ļ']
5929
+ ['Â', '°']
5930
  ['å°¼', 'æĸ¯']
5931
  ['å¿ĺ', 'è®°']
 
5932
  ['èĬ±', 'è´¹']
5933
  ['æİ¥', 'ç§į']
5934
  ['ç»Ī', 'æŃ¢']
 
10661
  ['r', 'ic']
10662
  ['åı²', 'å¯Ĩ']
10663
  ['åIJ¯', 'åıij']
 
10664
  ['Â', '®']
10665
+ ['å·²', 'çŁ¥']
10666
  ['å¤Ħ', 'åľ¨']
10667
  ['çļĦ', 'æĿ¡ä»¶']
10668
  ['äºĮ', '级']
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
The diff for this file is too large to render. See raw diff
 
vocab.json CHANGED
The diff for this file is too large to render. See raw diff