Upload folder using huggingface_hub
Browse files- README.md +2 -2
- merges.txt +0 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -1
- vocab.json +0 -0
README.md
CHANGED
|
@@ -21,9 +21,9 @@ A **Byte-Level BPE** tokenizer trained on **['ell_Grek']** data from Fineweb-2-H
|
|
| 21 |
| Algorithm | Byte-Level BPE |
|
| 22 |
| Language | `['ell_Grek']` |
|
| 23 |
| Target Vocab Size | 2,000 |
|
| 24 |
-
| Final Vocab Size |
|
| 25 |
| Pre-tokenizer | custom:addition |
|
| 26 |
-
| Number handling |
|
| 27 |
| Contraction handling | False |
|
| 28 |
| Normalizer | NFC |
|
| 29 |
| Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
|
|
|
|
| 21 |
| Algorithm | Byte-Level BPE |
|
| 22 |
| Language | `['ell_Grek']` |
|
| 23 |
| Target Vocab Size | 2,000 |
|
| 24 |
+
| Final Vocab Size | 383 |
|
| 25 |
| Pre-tokenizer | custom:addition |
|
| 26 |
+
| Number handling | ltr_3digit |
|
| 27 |
| Contraction handling | False |
|
| 28 |
| Normalizer | NFC |
|
| 29 |
| Special Tokens | `<s>`, `</s>`, `<pad>`, `<unk>` |
|
merges.txt
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
CHANGED
|
@@ -53,5 +53,5 @@
|
|
| 53 |
"pad_token": "<pad>",
|
| 54 |
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 55 |
"unk_token": null,
|
| 56 |
-
"number_handling": "
|
| 57 |
}
|
|
|
|
| 53 |
"pad_token": "<pad>",
|
| 54 |
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 55 |
"unk_token": null,
|
| 56 |
+
"number_handling": "ltr_3digit"
|
| 57 |
}
|
vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|