gsaltintas commited on
Commit
19e52cd
·
verified ·
1 Parent(s): 36f035b

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +10 -3
  2. merges.txt +0 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +3 -2
  5. vocab.json +0 -0
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: mit
3
  language:
4
- - arb
5
  tags:
6
  - tokenizer
7
  - bpe
@@ -20,8 +20,10 @@ A **Byte-Level BPE** tokenizer trained on **arb_Arab** data from Fineweb-2-HQ.
20
  | Algorithm | Byte-Level BPE |
21
  | Language | `arb_Arab` |
22
  | Target Vocab Size | 16,000 |
23
- | Final Vocab Size | 0 |
24
- | Pre-tokenizer | ByteLevel |
 
 
25
  | Normalizer | NFC |
26
  | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
27
  | Training Shards | 2 |
@@ -40,3 +42,8 @@ tokens = tokenizer.encode("Hello, world!")
40
  - `tokenizer.json` — Full HuggingFace tokenizer
41
  - `vocab.json` — Vocabulary mapping
42
  - `merges.txt` — BPE merge rules
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  language:
4
+ - arb # ISO 639-3 code or "und" if not identifiable
5
  tags:
6
  - tokenizer
7
  - bpe
 
20
  | Algorithm | Byte-Level BPE |
21
  | Language | `arb_Arab` |
22
  | Target Vocab Size | 16,000 |
23
+ | Final Vocab Size | 16,000 |
24
+ | Pre-tokenizer | gpt4 |
25
+ | Number handling | individual |
26
+ | Contraction handling | True |
27
  | Normalizer | NFC |
28
  | Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
29
  | Training Shards | 2 |
 
42
  - `tokenizer.json` — Full HuggingFace tokenizer
43
  - `vocab.json` — Vocabulary mapping
44
  - `merges.txt` — BPE merge rules
45
+
46
+ ## Sample Encoding
47
+ | Text | Tokens | Token IDs |
48
+ |------|--------|-----------|
49
+ | `Hello, world! 12345 This is a test. こんにちは` | `H, ell, o, ,, Ġw, orld, !, Ġ, 12, 3, 45, ĠTh, is, Ġis, Ġa, Ġt, est, ., Ġ, ãģ` | `43, 4047, 82, 15, 1168, 9534, 4, 178, 1228, 22, 4101, 5756, 853, 3818, 1153, 605, 2611, 17, 178, 11047` |
merges.txt CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -40,5 +40,6 @@
40
  "model_max_length": 1000000000000000019884624838656,
41
  "pad_token": "<pad>",
42
  "tokenizer_class": "PreTrainedTokenizerFast",
43
- "unk_token": "<unk>"
44
- }
 
 
40
  "model_max_length": 1000000000000000019884624838656,
41
  "pad_token": "<pad>",
42
  "tokenizer_class": "PreTrainedTokenizerFast",
43
+ "unk_token": "<unk>",
44
+ "number_handling": "individual"
45
+ }
vocab.json CHANGED
The diff for this file is too large to render. See raw diff